Norbert created ARROW-9959:
------------------------------

             Summary: [Python][C++][Parquet] Ability to delete row groups from 
metadata
                 Key: ARROW-9959
                 URL: https://issues.apache.org/jira/browse/ARROW-9959
             Project: Apache Arrow
          Issue Type: Improvement
            Reporter: Norbert


Hi,

We currently use PyArrow to maintain a partitioned dataset of Parquet files on 
disk. We also manage our own `_metadata` file - when new rows are written to 
the dataset, we use the `metadata_collector` argument of `write_to_dataset` to 
collect all metadata that was written inside individual files. We then load the 
existing `_metadata` file and merge it with all the newly-written metadatas 
using `metadata.append_row_groups` (as in the docs) and then write the result 
to `_metadata` on disk.

However, we would also like to occasionally amend this dataset by deleting 
individual files. In order to keep the `_metadata` file in sync, we would need 
to load the metadata of all the files we're willing to delete, then find their 
row groups inside `_metadata` and remove them. Therefore we require a method 
such as `delete_row_groups` to exist on the `FileMetaData` object. Would it be 
possible for PyArrow to support this? Another way of accomplishing the same 
thing would be to initialise an empty `FileMetaData` object and simply use 
`append_row_groups` to add back all the row groups that are required. However, 
I've been unable to accomplish this programmaticaly as the constructor for 
`FileMetaData` seems to ask for a C structure which I'm not sure how to 
construct.

Many thanks.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to