[ 
https://issues.apache.org/jira/browse/ARROW-1983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16818343#comment-16818343
 ] 

Pearu Peterson commented on ARROW-1983:
---------------------------------------

Note that the Parquet format has three different metadata structures, see 
[https://github.com/apache/parquet-format#metadata] .

The "_metadata" corresponds to `FileMetaData.key_value_metadata` (in 
parquet-format specification) while the "statistics" (that is of interest of 
Dask, if I understand it correctly) corresponds to 
`ColumnMetadata.key_value_metadata`.
Yes, Arrow can read all this information and more. My basic questions are:
 # What information needs to be collected? Note that some information is 
internal to parquet files that one would never need, hence it would just a 
waste of space to collect it, especially when the Datasets become huge (as 
would be expected in Dask applications).
 # Where this information should be gathered for easy and efficient access?

 

> [Python] Add ability to write parquet `_metadata` file
> ------------------------------------------------------
>
>                 Key: ARROW-1983
>                 URL: https://issues.apache.org/jira/browse/ARROW-1983
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++, Python
>            Reporter: Jim Crist
>            Priority: Major
>              Labels: beginner, parquet
>             Fix For: 0.14.0
>
>
> Currently {{pyarrow.parquet}} can only write the {{_common_metadata}} file 
> (mostly just schema information). It would be useful to add the ability to 
> write a {{_metadata}} file as well. This should include information about 
> each row group in the dataset, including summary statistics. Having this 
> summary file would allow filtering of row groups without needing to access 
> each file beforehand.
> This would require that the user is able to get the written RowGroups out of 
> a {{pyarrow.parquet.write_table}} call and then give these objects as a list 
> to new function that then passes them on as C++ objects to {{parquet-cpp}} 
> that generates the respective {{_metadata}} file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to