[jira] [Comment Edited] (ARROW-18148) [R] Rename read_ipc_file to read_arrow_file & highlight arrow over feather

Danielle Navarro (Jira) Mon, 31 Oct 2022 23:56:25 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-18148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17626980#comment-17626980
 ]


Danielle Navarro edited comment on ARROW-18148 at 11/1/22 6:54 AM:
-------------------------------------------------------------------

Tentatively offering some thoughts :-)

If I'm understanding this properly, we have two problems:

- The first problem is that the history of serializing Arrow objects is messy 
and has left us with three names that people might recognize: Feather, IPC, 
Arrow. We'd like users to transition to using "Arrow" as the preferred name, 
and to give them an API that reflects that terminology.

- The second problem is that we use "file format" and "stream format" to mean 
something subtly different from "files" and "streams". The file format wraps 
the stream format with magic numbers at the start and end, with a footer 
written after the stream. These two formats aren't *inherently* tied to files 
and streams. The user can write a "stream formatted" file if they want (i.e., 
no magic numbers, no footers) and they can also send a "file formatted" 
serialization (i.e., with the magic number and footer) to an output stream if 
they want to. The current API allows this, but users would be forgiven for 
missing this subtle detail!

h2. Option 1: Don't change the API, only the docs

This option would leave `read_ipc_file()`, `write_ipc_file()`, 
`read_ipc_stream()`, and `write_ipc_stream()` as the four user-facing functions 
(treating `read_feather()` and `write_feather()` as soft-deprecated, and 
leaving `write_to_raw()` untouched)

The only thing that would change in this version is that we would consistently 
refer to "Arrow IPC file" and "Arrow IPC stream" everywhere (i.e., never 
truncating it to "IPC"). Language around "feather" would be relegated to a 
secondary position (e.g., "formerly known as Feather"), and we would emphasize 
that the preferred file extension is `.arrow`. 

h2. Option 2: New names for the existing four functions

This option would replace `read_ipc_file()` with `read_arrow_file()`, 
`read_ipc_stream()` with `read_arrow_stream()` and so on. The `ipc` and 
`feather` versions would be soft-deprecated.  

The documentation would be updated accordingly. We'd now refer to "Arrow file" 
and "Arrow stream" everywhere. As with option 1 we'd use language like 
"formerly known as Feather" to explain the history (perhaps linking back to the 
old repo just to highlight the origin). We would also, where relevant, note 
that "Arrow stream" is a conventional name for the "Arrow inter-process 
communication (IPC) streaming format", as a way of (a) explaining the ipc 
versions of the functions, and (b) helping users find the relevant part of the 
Arrow specification.

h2. Option 3: Reduce API to two functions

This option would have only two functions, `read_arrow()` and `write_arrow()`. 
Both functions would have a new argument called `format` (or something 
similar). Users could specify either `format = "stream"` or `format = "file"`. 
From a documentation perspective this would require a little more finessing: we 
might have to have separate the help topics for the new API and older versions 
of API to avoid mess. But it might have the advantage of making clearer to 
users that the terms `"stream"` and `"file"` don't actually refer to *where* 
you're writing the data, but how you *encode* the data when you write it. 

h2. Preferences?

I am not sure what I prefer, but I can at least say what I think the strengths 
and weaknesses are for each proposal:

Option 3 seems like the cleanest in terms of making the Arrow/Feather/IPC 
functions feel analogous to the other functions in the read/write API: 
`read_arrow()` and `write_arrow()` feels closely aligned with `read_parquet()` 
and `write_parquet()`. It makes very clear that these functions are designed to 
read and write Arrow objects in an "Arrow-like" way. However, it does have the 
disadvantage that the encoding vs destination complexity gets pushed into the 
arguments: users will need to understand why there is `format` argument that is 
distinct from the `file`/`sink` argument, and the documentation will need to 
explain that. 

Option 2 has the advantage of preserving the same "four-function structure"" as 
the existing serialization API, but it does come at the expense of being a 
little misleading to anyone who doesn't understand that the function names 
refer to the encoding not the destination: `write_arrow_stream()` can in fact 
write to a file, and `write_arrow_file()` can write to a stream. That's 
potentially even more confusing. 

Option 1 has the advantage of not confusing existing users. The API doesn't 
change, and the documentation becomes slightly more informative. The 
disadvantage is that it leaves new users a bit confused about what the heck an 
"IPC" is, which means the documentation will have to carry the load. 

h2. Additional documentation thoughts

Regardless of what option we go with, I'll write the user-facing vignettes to 
use only the newest version of the API, especially in the `arrow.Rmd` vignette 
and the `read_write.Rmd` vignette where new users are most likely to run across 
these concepts. In those places I would try my best not to dive into too much 
detail, because it's a complexity that new users don't need. 

The question that arises is "where do we talk about the nuance?" To some extent 
I think we could move some of that to the "details" section of various help 
topics, but... (and I hate saying this)... it might make sense to write an 
"Arrow serialization" vignette that would be loosely analogous to the "Data 
object layout" vignette that I'm proposing to introduce in  
https://github.com/apache/arrow/pull/14514. On the documentation page it would 
be grouped in with the developer vignettes (to signal that it's advanced 
content), but just like I'm doing with "Data object layout", I'll cross 
reference it from the user-facing vignettes. For instance, in the section on 
reading and writing arrow (formerly feather) files, there would be a short 
paragraph that hints at these issues, and then links the user to the 
serialization vignette where all the detail is unpacked.

 

 


was (Author: JIRAUSER283377):
Tentatively offering some thoughts :-)

If I'm understanding this properly, we have two problems:

- The first problem is that the history of serializing Arrow objects is messy 
and has left us with three names that people might recognize: Feather, IPC, 
Arrow. We'd like users to transition to using "Arrow" as the preferred name, 
and to give them an API that reflects that terminology.

- The second problem is that we use "file format" and "stream format" to mean 
something subtly different from "files" and "streams". The file format wraps 
the stream format with magic numbers at the start and end, with a footer 
written after the stream. These two formats aren't *inherently* tied to files 
and streams. The user can write a "stream formatted" file if they want (i.e., 
no magic numbers, no footers) and they can also send a "file formatted" 
serialization (i.e., with the magic number and footer) to an output stream if 
they want to. The current API allows this, but users would be forgiven for 
missing this subtle detail!

## Option 1: Don't change the API, only the docs

This option would leave `read_ipc_file()`, `write_ipc_file()`, 
`read_ipc_stream()`, and `write_ipc_stream()` as the four user-facing functions 
(treating `read_feather()` and `write_feather()` as soft-deprecated, and 
leaving `write_to_raw()` untouched)

The only thing that would change in this version is that we would consistently 
refer to "Arrow IPC file" and "Arrow IPC stream" everywhere (i.e., never 
truncating it to "IPC"). Language around "feather" would be relegated to a 
secondary position (e.g., "formerly known as Feather"), and we would emphasize 
that the preferred file extension is `.arrow`. 

## Option 2: New names for the existing four functions

This option would replace `read_ipc_file()` with `read_arrow_file()`, 
`read_ipc_stream()` with `read_arrow_stream()` and so on. The `ipc` and 
`feather` versions would be soft-deprecated.  

The documentation would be updated accordingly. We'd now refer to "Arrow file" 
and "Arrow stream" everywhere. As with option 1 we'd use language like 
"formerly known as Feather" to explain the history (perhaps linking back to the 
old repo just to highlight the origin). We would also, where relevant, note 
that "Arrow stream" is a conventional name for the "Arrow inter-process 
communication (IPC) streaming format", as a way of (a) explaining the ipc 
versions of the functions, and (b) helping users find the relevant part of the 
Arrow specification.

## Option 3: Reduce API to two functions

This option would have only two functions, `read_arrow()` and `write_arrow()`. 
Both functions would have a new argument called `format` (or something 
similar). Users could specify either `format = "stream"` or `format = "file"`. 
From a documentation perspective this would require a little more finessing: we 
might have to have separate the help topics for the new API and older versions 
of API to avoid mess. But it might have the advantage of making clearer to 
users that the terms `"stream"` and `"file"` don't actually refer to *where* 
you're writing the data, but how you *encode* the data when you write it. 

## Preferences?

I am not sure what I prefer, but I can at least say what I think the strengths 
and weaknesses are for each proposal:

Option 3 seems like the cleanest in terms of making the Arrow/Feather/IPC 
functions feel analogous to the other functions in the read/write API: 
`read_arrow()` and `write_arrow()` feels closely aligned with `read_parquet()` 
and `write_parquet()`. It makes very clear that these functions are designed to 
read and write Arrow objects in an "Arrow-like" way. However, it does have the 
disadvantage that the encoding vs destination complexity gets pushed into the 
arguments: users will need to understand why there is `format` argument that is 
distinct from the `file`/`sink` argument, and the documentation will need to 
explain that. 

Option 2 has the advantage of preserving the same "four-function structure"" as 
the existing serialization API, but it does come at the expense of being a 
little misleading to anyone who doesn't understand that the function names 
refer to the encoding not the destination: `write_arrow_stream()` can in fact 
write to a file, and `write_arrow_file()` can write to a stream. That's 
potentially even more confusing. 

Option 1 has the advantage of not confusing existing users. The API doesn't 
change, and the documentation becomes slightly more informative. The 
disadvantage is that it leaves new users a bit confused about what the heck an 
"IPC" is, which means the documentation will have to carry the load. 

## Additional documentation thoughts

Regardless of what option we go with, I'll write the user-facing vignettes to 
use only the newest version of the API, especially in the `arrow.Rmd` vignette 
and the `read_write.Rmd` vignette where new users are most likely to run across 
these concepts. In those places I would try my best not to dive into too much 
detail, because it's a complexity that new users don't need. 

The question that arises is "where do we talk about the nuance?" To some extent 
I think we could move some of that to the "details" section of various help 
topics, but... (and I hate saying this)... it might make sense to write an 
"Arrow serialization" vignette that would be loosely analogous to the "Data 
object layout" vignette that I'm proposing to introduce in  
https://github.com/apache/arrow/pull/14514. On the documentation page it would 
be grouped in with the developer vignettes (to signal that it's advanced 
content), but just like I'm doing with "Data object layout", I'll cross 
reference it from the user-facing vignettes. For instance, in the section on 
reading and writing arrow (formerly feather) files, there would be a short 
paragraph that hints at these issues, and then links the user to the 
serialization vignette where all the detail is unpacked.

 

 

> [R] Rename read_ipc_file to read_arrow_file & highlight arrow over feather
> --------------------------------------------------------------------------
>
>                 Key: ARROW-18148
>                 URL: https://issues.apache.org/jira/browse/ARROW-18148
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Documentation, R
>            Reporter: Stephanie Hazlitt
>            Priority: Minor
>              Labels: feather
>
> Following up from [this mailing list 
> conversation|https://lists.apache.org/thread/nxncph842h8tyovxp04hrzq4y35lq4xq],
>  I am wondering if the R package should rename `read_ipc_file()` / 
> write_ipc_file()` to `read_arrow_file()`/ `write_arrow_file()`, or add an 
> additional alias for both. It might also be helpful to update the 
> documentation so that users read "Write an Arrow file (formerly known as a 
> Feather file)" rather than the current Feather-named first approach, assuming 
> there is a community decision to coalesce around the name Arrow for the file 
> format, and the project is moving on from the name Feather.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (ARROW-18148) [R] Rename read_ipc_file to read_arrow_file & highlight arrow over feather

Reply via email to