Norbert created ARROW-9959:
------------------------------
Summary: [Python][C++][Parquet] Ability to delete row groups from
metadata
Key: ARROW-9959
URL: https://issues.apache.org/jira/browse/ARROW-9959
Project: Apache Arrow
Issue Type: Improvement
Reporter: Norbert
Hi,
We currently use PyArrow to maintain a partitioned dataset of Parquet files on
disk. We also manage our own `_metadata` file - when new rows are written to
the dataset, we use the `metadata_collector` argument of `write_to_dataset` to
collect all metadata that was written inside individual files. We then load the
existing `_metadata` file and merge it with all the newly-written metadatas
using `metadata.append_row_groups` (as in the docs) and then write the result
to `_metadata` on disk.
However, we would also like to occasionally amend this dataset by deleting
individual files. In order to keep the `_metadata` file in sync, we would need
to load the metadata of all the files we're willing to delete, then find their
row groups inside `_metadata` and remove them. Therefore we require a method
such as `delete_row_groups` to exist on the `FileMetaData` object. Would it be
possible for PyArrow to support this? Another way of accomplishing the same
thing would be to initialise an empty `FileMetaData` object and simply use
`append_row_groups` to add back all the row groups that are required. However,
I've been unable to accomplish this programmaticaly as the constructor for
`FileMetaData` seems to ask for a C structure which I'm not sure how to
construct.
Many thanks.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)