[ 
https://issues.apache.org/jira/browse/ARROW-1983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16820084#comment-16820084
 ] 

Pearu Peterson commented on ARROW-1983:
---------------------------------------

There seems to be two options to write a separate metadata file from Arrow:
 # Following Wes comment "On the C++ side we would expose an API to append row 
group metadata into a common file.", introduce the second sink argument to 
ParquetFileWriter::Open that will be used for collecting FileMetaData content 
during the writing. [~wesmckinn], can you confirm that this would be the right 
approach?
 # Introduce a flag to ParquetFileWriter that when enabled will cause skipping 
all data writes and would write only FileMetaData content. As a result, one 
would need to call the dataset write twice, one for writing data (and metadata) 
as currently, and the second time for writing metadata-only (writes would be 
collected to a single file).

Comparing the two approaches, the approach 2 is simpler but suboptimal as the 
writing process is executed twice. In both cases, the metadata would have 
duplicated storage (in data files as currently, and in the separate metadata 
file). If readers would be able to use metadata from a separate file (not sure 
if parquet format would allow it), duplicating metadata storage in both 
approaches could be avoided.

> [Python] Add ability to write parquet `_metadata` file
> ------------------------------------------------------
>
>                 Key: ARROW-1983
>                 URL: https://issues.apache.org/jira/browse/ARROW-1983
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++, Python
>            Reporter: Jim Crist
>            Priority: Major
>              Labels: beginner, parquet
>             Fix For: 0.14.0
>
>
> Currently {{pyarrow.parquet}} can only write the {{_common_metadata}} file 
> (mostly just schema information). It would be useful to add the ability to 
> write a {{_metadata}} file as well. This should include information about 
> each row group in the dataset, including summary statistics. Having this 
> summary file would allow filtering of row groups without needing to access 
> each file beforehand.
> This would require that the user is able to get the written RowGroups out of 
> a {{pyarrow.parquet.write_table}} call and then give these objects as a list 
> to new function that then passes them on as C++ objects to {{parquet-cpp}} 
> that generates the respective {{_metadata}} file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to