[GitHub] [arrow] assignUser commented on issue #14834: [Python] write_dataset how to add and update data

GitBox Mon, 05 Dec 2022 04:30:57 -0800


assignUser commented on issue #14834:
URL: https://github.com/apache/arrow/issues/14834#issuecomment-1337264233

No you can not update or insert into an existing parquet file as they are
immutable. This is a restriction inherent to parquet, not pyarrow. (the spec
theoretically supports appending but no lib supports it,
[details](https://stackoverflow.com/a/74206625/19933286))

So to update an existing parquet file you have to read the existing data
into memory, add the new data and write that to disk as a new file (with the
same name). You can use partitioning to add/append new data to a
multi-parquet-file data set by adding new files or overwriting only small
partitions. See pyarrow
[docs](https://arrow.apache.org/docs/python/generated/pyarrow.dataset.write_dataset.html#pyarrow.dataset.write_dataset)
for `exisiting_data_behavior`:

> This behavior, in combination with a unique basename_template for each
write, will allow for an append workflow.
>
>‘delete_matching’ is useful when you are writing a partitioned dataset. The
first time each partition directory is encountered the entire directory will be
deleted. This allows you to overwrite old partitions completely.

I have opened https://github.com/apache/arrow-cookbook/issues/278 to add an
example of this to the [python
cookbook](https://arrow.apache.org/cookbook/py/index.html)

I am not quite sure I understand your second question.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] assignUser commented on issue #14834: [Python] write_dataset how to add and update data

Reply via email to