[ https://issues.apache.org/jira/browse/ARROW-1983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16820084#comment-16820084 ]
Pearu Peterson commented on ARROW-1983: --------------------------------------- There seems to be two options to write a separate metadata file from Arrow: # Following Wes comment "On the C++ side we would expose an API to append row group metadata into a common file.", introduce the second sink argument to ParquetFileWriter::Open that will be used for collecting FileMetaData content during the writing. [~wesmckinn], can you confirm that this would be the right approach? # Introduce a flag to ParquetFileWriter that when enabled will cause skipping all data writes and would write only FileMetaData content. As a result, one would need to call the dataset write twice, one for writing data (and metadata) as currently, and the second time for writing metadata-only (writes would be collected to a single file). Comparing the two approaches, the approach 2 is simpler but suboptimal as the writing process is executed twice. In both cases, the metadata would have duplicated storage (in data files as currently, and in the separate metadata file). If readers would be able to use metadata from a separate file (not sure if parquet format would allow it), duplicating metadata storage in both approaches could be avoided. > [Python] Add ability to write parquet `_metadata` file > ------------------------------------------------------ > > Key: ARROW-1983 > URL: https://issues.apache.org/jira/browse/ARROW-1983 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python > Reporter: Jim Crist > Priority: Major > Labels: beginner, parquet > Fix For: 0.14.0 > > > Currently {{pyarrow.parquet}} can only write the {{_common_metadata}} file > (mostly just schema information). It would be useful to add the ability to > write a {{_metadata}} file as well. This should include information about > each row group in the dataset, including summary statistics. Having this > summary file would allow filtering of row groups without needing to access > each file beforehand. > This would require that the user is able to get the written RowGroups out of > a {{pyarrow.parquet.write_table}} call and then give these objects as a list > to new function that then passes them on as C++ objects to {{parquet-cpp}} > that generates the respective {{_metadata}} file. -- This message was sent by Atlassian JIRA (v7.6.3#76005)