[
https://issues.apache.org/jira/browse/ARROW-14904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17452558#comment-17452558
]
Weston Pace commented on ARROW-14904:
-------------------------------------
I'm a little bit torn here. Append is definitely something that users want.
It is asked for a lot[1][2][3] (this is just a sample, there are at least 5
variations of "how do I append to parquet", and some on the ML too).
But the answer is very confusing to users. The parquet format page has
confused many people with this line:
{quote}The format is explicitly designed to separate the metadata from the
data. This allows splitting columns into multiple files, as well as having a
single metadata file reference multiple parquet files.
{quote}
Spark further confuses the picture with "SaveMode.Append" which is documented
as:
{quote}Append mode means that when saving a DataFrame to a data source, if
data/table already exists, contents of the DataFrame are expected to be
appended to existing data.
{quote}
But...what is actually happening is it is either reading in the file and
rewriting it or creating a new file in the same "dataset" (I don't recall off
the top of my head which of these two it is).
So it has been useful for me to be able to parrot a simple line "No. You
cannot append to an existing file. The preferred operation is to create a new
file in the same dataset. If you are doing many small writes then you can
concatenate them in memory or you can periodically merge files after they are
written".
So I guess I worry about the slippery slope. "Users might sometimes want to
append data so lets add that to the filesystem" leads to "Users want to be able
to append to CSV files" leads to "We should add an append mode to write_dataset
since there is at least one format that supports it" which leads to further
confusing users.
I won't stand in the way of adding append to CSV if wanted but I would be
pretty stubborn about adding append to write_dataset.
[1]
https://stackoverflow.com/questions/44608076/can-you-append-to-a-feather-format
[2]
https://stackoverflow.com/questions/47113813/using-pyarrow-how-do-you-append-to-parquet-file
[3] https://stackoverflow.com/questions/38793170/appending-to-orc-file
> [C++] Enable CSV Writer to append / overwrite existing file
> -----------------------------------------------------------
>
> Key: ARROW-14904
> URL: https://issues.apache.org/jira/browse/ARROW-14904
> Project: Apache Arrow
> Issue Type: Sub-task
> Components: C++
> Reporter: Dragoș Moldovan-Grünfeld
> Priority: Major
> Labels: good-first-issue
>
> This would be a match for the {{readr::write_csv()}} {{append}} argument:
> boolean. If {{FALSE}} will overwrite existing file. If {{TRUE}} will append
> to existing file. In both cases, if the file doesn't exist, a new file is
> created.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)