[
https://issues.apache.org/jira/browse/ARROW-1983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16849187#comment-16849187
]
Joris Van den Bossche commented on ARROW-1983:
----------------------------------------------
I think so yes (at least when reading, it returns a single FileMetadata
instance with all row groups).
Besides the "append" operation, we also need a "write" method for such
FileMetadata instance (I suppose this only needs some work on the python/cython
side, since this is just writing a parquet file without actual data, although
didn't check C++). There is currently a {{write_metadata}}, but that requires
an *arrow* schema, and not a *parquet* schema.
Regarding the public API, I suppose we can modify {{write_metadata}} to also
accept a parquet schema, to not have to add an extra function. That will need
some changes under the hood in {{ParquetWriter}} to be able to accept a given
FileMetadata object.
> [Python] Add ability to write parquet `_metadata` file
> ------------------------------------------------------
>
> Key: ARROW-1983
> URL: https://issues.apache.org/jira/browse/ARROW-1983
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++, Python
> Reporter: Jim Crist
> Priority: Major
> Labels: beginner, parquet, pull-request-available
> Fix For: 0.14.0
>
> Time Spent: 6h 20m
> Remaining Estimate: 0h
>
> Currently {{pyarrow.parquet}} can only write the {{_common_metadata}} file
> (mostly just schema information). It would be useful to add the ability to
> write a {{_metadata}} file as well. This should include information about
> each row group in the dataset, including summary statistics. Having this
> summary file would allow filtering of row groups without needing to access
> each file beforehand.
> This would require that the user is able to get the written RowGroups out of
> a {{pyarrow.parquet.write_table}} call and then give these objects as a list
> to new function that then passes them on as C++ objects to {{parquet-cpp}}
> that generates the respective {{_metadata}} file.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)