[
https://issues.apache.org/jira/browse/ARROW-18148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17626980#comment-17626980
]
Danielle Navarro edited comment on ARROW-18148 at 11/1/22 6:54 AM:
-------------------------------------------------------------------
Tentatively offering some thoughts :-)
If I'm understanding this properly, we have two problems:
- The first problem is that the history of serializing Arrow objects is messy
and has left us with three names that people might recognize: Feather, IPC,
Arrow. We'd like users to transition to using "Arrow" as the preferred name,
and to give them an API that reflects that terminology.
- The second problem is that we use "file format" and "stream format" to mean
something subtly different from "files" and "streams". The file format wraps
the stream format with magic numbers at the start and end, with a footer
written after the stream. These two formats aren't *inherently* tied to files
and streams. The user can write a "stream formatted" file if they want (i.e.,
no magic numbers, no footers) and they can also send a "file formatted"
serialization (i.e., with the magic number and footer) to an output stream if
they want to. The current API allows this, but users would be forgiven for
missing this subtle detail!
h2. Option 1: Don't change the API, only the docs
This option would leave `read_ipc_file()`, `write_ipc_file()`,
`read_ipc_stream()`, and `write_ipc_stream()` as the four user-facing functions
(treating `read_feather()` and `write_feather()` as soft-deprecated, and
leaving `write_to_raw()` untouched)
The only thing that would change in this version is that we would consistently
refer to "Arrow IPC file" and "Arrow IPC stream" everywhere (i.e., never
truncating it to "IPC"). Language around "feather" would be relegated to a
secondary position (e.g., "formerly known as Feather"), and we would emphasize
that the preferred file extension is `.arrow`.
h2. Option 2: New names for the existing four functions
This option would replace `read_ipc_file()` with `read_arrow_file()`,
`read_ipc_stream()` with `read_arrow_stream()` and so on. The `ipc` and
`feather` versions would be soft-deprecated.
The documentation would be updated accordingly. We'd now refer to "Arrow file"
and "Arrow stream" everywhere. As with option 1 we'd use language like
"formerly known as Feather" to explain the history (perhaps linking back to the
old repo just to highlight the origin). We would also, where relevant, note
that "Arrow stream" is a conventional name for the "Arrow inter-process
communication (IPC) streaming format", as a way of (a) explaining the ipc
versions of the functions, and (b) helping users find the relevant part of the
Arrow specification.
h2. Option 3: Reduce API to two functions
This option would have only two functions, `read_arrow()` and `write_arrow()`.
Both functions would have a new argument called `format` (or something
similar). Users could specify either `format = "stream"` or `format = "file"`.
From a documentation perspective this would require a little more finessing: we
might have to have separate the help topics for the new API and older versions
of API to avoid mess. But it might have the advantage of making clearer to
users that the terms `"stream"` and `"file"` don't actually refer to *where*
you're writing the data, but how you *encode* the data when you write it.
h2. Preferences?
I am not sure what I prefer, but I can at least say what I think the strengths
and weaknesses are for each proposal:
Option 3 seems like the cleanest in terms of making the Arrow/Feather/IPC
functions feel analogous to the other functions in the read/write API:
`read_arrow()` and `write_arrow()` feels closely aligned with `read_parquet()`
and `write_parquet()`. It makes very clear that these functions are designed to
read and write Arrow objects in an "Arrow-like" way. However, it does have the
disadvantage that the encoding vs destination complexity gets pushed into the
arguments: users will need to understand why there is `format` argument that is
distinct from the `file`/`sink` argument, and the documentation will need to
explain that.
Option 2 has the advantage of preserving the same "four-function structure"" as
the existing serialization API, but it does come at the expense of being a
little misleading to anyone who doesn't understand that the function names
refer to the encoding not the destination: `write_arrow_stream()` can in fact
write to a file, and `write_arrow_file()` can write to a stream. That's
potentially even more confusing.
Option 1 has the advantage of not confusing existing users. The API doesn't
change, and the documentation becomes slightly more informative. The
disadvantage is that it leaves new users a bit confused about what the heck an
"IPC" is, which means the documentation will have to carry the load.
h2. Additional documentation thoughts
Regardless of what option we go with, I'll write the user-facing vignettes to
use only the newest version of the API, especially in the `arrow.Rmd` vignette
and the `read_write.Rmd` vignette where new users are most likely to run across
these concepts. In those places I would try my best not to dive into too much
detail, because it's a complexity that new users don't need.
The question that arises is "where do we talk about the nuance?" To some extent
I think we could move some of that to the "details" section of various help
topics, but... (and I hate saying this)... it might make sense to write an
"Arrow serialization" vignette that would be loosely analogous to the "Data
object layout" vignette that I'm proposing to introduce in
https://github.com/apache/arrow/pull/14514. On the documentation page it would
be grouped in with the developer vignettes (to signal that it's advanced
content), but just like I'm doing with "Data object layout", I'll cross
reference it from the user-facing vignettes. For instance, in the section on
reading and writing arrow (formerly feather) files, there would be a short
paragraph that hints at these issues, and then links the user to the
serialization vignette where all the detail is unpacked.
was (Author: JIRAUSER283377):
Tentatively offering some thoughts :-)
If I'm understanding this properly, we have two problems:
- The first problem is that the history of serializing Arrow objects is messy
and has left us with three names that people might recognize: Feather, IPC,
Arrow. We'd like users to transition to using "Arrow" as the preferred name,
and to give them an API that reflects that terminology.
- The second problem is that we use "file format" and "stream format" to mean
something subtly different from "files" and "streams". The file format wraps
the stream format with magic numbers at the start and end, with a footer
written after the stream. These two formats aren't *inherently* tied to files
and streams. The user can write a "stream formatted" file if they want (i.e.,
no magic numbers, no footers) and they can also send a "file formatted"
serialization (i.e., with the magic number and footer) to an output stream if
they want to. The current API allows this, but users would be forgiven for
missing this subtle detail!
## Option 1: Don't change the API, only the docs
This option would leave `read_ipc_file()`, `write_ipc_file()`,
`read_ipc_stream()`, and `write_ipc_stream()` as the four user-facing functions
(treating `read_feather()` and `write_feather()` as soft-deprecated, and
leaving `write_to_raw()` untouched)
The only thing that would change in this version is that we would consistently
refer to "Arrow IPC file" and "Arrow IPC stream" everywhere (i.e., never
truncating it to "IPC"). Language around "feather" would be relegated to a
secondary position (e.g., "formerly known as Feather"), and we would emphasize
that the preferred file extension is `.arrow`.
## Option 2: New names for the existing four functions
This option would replace `read_ipc_file()` with `read_arrow_file()`,
`read_ipc_stream()` with `read_arrow_stream()` and so on. The `ipc` and
`feather` versions would be soft-deprecated.
The documentation would be updated accordingly. We'd now refer to "Arrow file"
and "Arrow stream" everywhere. As with option 1 we'd use language like
"formerly known as Feather" to explain the history (perhaps linking back to the
old repo just to highlight the origin). We would also, where relevant, note
that "Arrow stream" is a conventional name for the "Arrow inter-process
communication (IPC) streaming format", as a way of (a) explaining the ipc
versions of the functions, and (b) helping users find the relevant part of the
Arrow specification.
## Option 3: Reduce API to two functions
This option would have only two functions, `read_arrow()` and `write_arrow()`.
Both functions would have a new argument called `format` (or something
similar). Users could specify either `format = "stream"` or `format = "file"`.
From a documentation perspective this would require a little more finessing: we
might have to have separate the help topics for the new API and older versions
of API to avoid mess. But it might have the advantage of making clearer to
users that the terms `"stream"` and `"file"` don't actually refer to *where*
you're writing the data, but how you *encode* the data when you write it.
## Preferences?
I am not sure what I prefer, but I can at least say what I think the strengths
and weaknesses are for each proposal:
Option 3 seems like the cleanest in terms of making the Arrow/Feather/IPC
functions feel analogous to the other functions in the read/write API:
`read_arrow()` and `write_arrow()` feels closely aligned with `read_parquet()`
and `write_parquet()`. It makes very clear that these functions are designed to
read and write Arrow objects in an "Arrow-like" way. However, it does have the
disadvantage that the encoding vs destination complexity gets pushed into the
arguments: users will need to understand why there is `format` argument that is
distinct from the `file`/`sink` argument, and the documentation will need to
explain that.
Option 2 has the advantage of preserving the same "four-function structure"" as
the existing serialization API, but it does come at the expense of being a
little misleading to anyone who doesn't understand that the function names
refer to the encoding not the destination: `write_arrow_stream()` can in fact
write to a file, and `write_arrow_file()` can write to a stream. That's
potentially even more confusing.
Option 1 has the advantage of not confusing existing users. The API doesn't
change, and the documentation becomes slightly more informative. The
disadvantage is that it leaves new users a bit confused about what the heck an
"IPC" is, which means the documentation will have to carry the load.
## Additional documentation thoughts
Regardless of what option we go with, I'll write the user-facing vignettes to
use only the newest version of the API, especially in the `arrow.Rmd` vignette
and the `read_write.Rmd` vignette where new users are most likely to run across
these concepts. In those places I would try my best not to dive into too much
detail, because it's a complexity that new users don't need.
The question that arises is "where do we talk about the nuance?" To some extent
I think we could move some of that to the "details" section of various help
topics, but... (and I hate saying this)... it might make sense to write an
"Arrow serialization" vignette that would be loosely analogous to the "Data
object layout" vignette that I'm proposing to introduce in
https://github.com/apache/arrow/pull/14514. On the documentation page it would
be grouped in with the developer vignettes (to signal that it's advanced
content), but just like I'm doing with "Data object layout", I'll cross
reference it from the user-facing vignettes. For instance, in the section on
reading and writing arrow (formerly feather) files, there would be a short
paragraph that hints at these issues, and then links the user to the
serialization vignette where all the detail is unpacked.
> [R] Rename read_ipc_file to read_arrow_file & highlight arrow over feather
> --------------------------------------------------------------------------
>
> Key: ARROW-18148
> URL: https://issues.apache.org/jira/browse/ARROW-18148
> Project: Apache Arrow
> Issue Type: Improvement
> Components: Documentation, R
> Reporter: Stephanie Hazlitt
> Priority: Minor
> Labels: feather
>
> Following up from [this mailing list
> conversation|https://lists.apache.org/thread/nxncph842h8tyovxp04hrzq4y35lq4xq],
> I am wondering if the R package should rename `read_ipc_file()` /
> write_ipc_file()` to `read_arrow_file()`/ `write_arrow_file()`, or add an
> additional alias for both. It might also be helpful to update the
> documentation so that users read "Write an Arrow file (formerly known as a
> Feather file)" rather than the current Feather-named first approach, assuming
> there is a community decision to coalesce around the name Arrow for the file
> format, and the project is moving on from the name Feather.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)