[
https://issues.apache.org/jira/browse/ARROW-1983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16841625#comment-16841625
]
Wes McKinney commented on ARROW-1983:
-------------------------------------
Right. This is a relatively straightforward C++ function to write – Pearu
actually already had partially implemented it in one of the patch iterations.
The API would be something like
{code:java}
Status WriteMultipleMetadata(const std::vector<std::shared_ptr<FileMetaData>>&
metadatas,
arrow::io::OutputStream* out);
{code}
Does someone want to write it (I mean, I can do it, but it would be good for
other people to get some experience with the Parquet codebase)? We also need to
make sure that the file path is being set in the metadata, otherwise the
{{_metadata}} file is useless
> [Python] Add ability to write parquet `_metadata` file
> ------------------------------------------------------
>
> Key: ARROW-1983
> URL: https://issues.apache.org/jira/browse/ARROW-1983
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++, Python
> Reporter: Jim Crist
> Priority: Major
> Labels: beginner, parquet, pull-request-available
> Fix For: 0.14.0
>
> Time Spent: 5h 50m
> Remaining Estimate: 0h
>
> Currently {{pyarrow.parquet}} can only write the {{_common_metadata}} file
> (mostly just schema information). It would be useful to add the ability to
> write a {{_metadata}} file as well. This should include information about
> each row group in the dataset, including summary statistics. Having this
> summary file would allow filtering of row groups without needing to access
> each file beforehand.
> This would require that the user is able to get the written RowGroups out of
> a {{pyarrow.parquet.write_table}} call and then give these objects as a list
> to new function that then passes them on as C++ objects to {{parquet-cpp}}
> that generates the respective {{_metadata}} file.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)