[jira] [Commented] (ARROW-14904) [C++] Enable CSV Writer to append / overwrite existing file

Weston Pace (Jira) Thu, 02 Dec 2021 10:50:06 -0800


    [ 
https://issues.apache.org/jira/browse/ARROW-14904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17452558#comment-17452558
 ]


Weston Pace commented on ARROW-14904:
-------------------------------------

I'm a little bit torn here.  Append is definitely something that users want.  
It is asked for a lot[1][2][3] (this is just a sample, there are at least 5 
variations of "how do I append to parquet", and some on the ML too).

But the answer is very confusing to users.  The parquet format page has 
confused many people with this line:

{quote}The format is explicitly designed to separate the metadata from the 
data. This allows splitting columns into multiple files, as well as having a 
single metadata file reference multiple parquet files. 
{quote}

Spark further confuses the picture with "SaveMode.Append" which is documented 
as:

{quote}Append mode means that when saving a DataFrame to a data source, if 
data/table already exists, contents of the DataFrame are expected to be 
appended to existing data.
{quote}

But...what is actually happening is it is either reading in the file and 
rewriting it or creating a new file in the same "dataset" (I don't recall off 
the top of my head which of these two it is).

So it has been useful for me to be able to parrot a simple line "No.  You 
cannot append to an existing file.  The preferred operation is to create a new 
file in the same dataset.  If you are doing many small writes then you can 
concatenate them in memory or you can periodically merge files after they are 
written".

So I guess I worry about the slippery slope.  "Users might sometimes want to 
append data so lets add that to the filesystem" leads to "Users want to be able 
to append to CSV files" leads to "We should add an append mode to write_dataset 
since there is at least one format that supports it" which leads to further 
confusing users.

I won't stand in the way of adding append to CSV if wanted but I would be 
pretty stubborn about adding append to write_dataset.

[1] 
https://stackoverflow.com/questions/44608076/can-you-append-to-a-feather-format
[2] 
https://stackoverflow.com/questions/47113813/using-pyarrow-how-do-you-append-to-parquet-file
[3] https://stackoverflow.com/questions/38793170/appending-to-orc-file

> [C++] Enable CSV Writer to append / overwrite existing file
> -----------------------------------------------------------
>
>                 Key: ARROW-14904
>                 URL: https://issues.apache.org/jira/browse/ARROW-14904
>             Project: Apache Arrow
>          Issue Type: Sub-task
>          Components: C++
>            Reporter: Dragoș Moldovan-Grünfeld
>            Priority: Major
>              Labels: good-first-issue
>
> This would be a match for the {{readr::write_csv()}} {{append}} argument: 
> boolean. If {{FALSE}} will overwrite existing file. If {{TRUE}} will append 
> to existing file. In both cases, if the file doesn't exist, a new file is 
> created. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (ARROW-14904) [C++] Enable CSV Writer to append / overwrite existing file

Reply via email to