[ https://issues.apache.org/jira/browse/ARROW-8733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17101909#comment-17101909 ]
Francois Saint-Jacques commented on ARROW-8733: ----------------------------------------------- We could expose this yes. In ARROW-8062, I'm constructing the ParquetFileFragment without holding the original FileMetaData for mutltiple reasons, but more importantly because it may or not map to what the real physical FileMetaData holds. I think that it would be ill conceived to create a fake FileMetaData constructed from the `_metadata`. I plan to expose the statistics probably via shared_ptr<Expression>. There is various details to flesh out regarding this, especially the one where a fragment only contains a strict subset of the row groups. What about serialization for dask? That's another potential issue. > [C++][Dataset][Python] ParquetFileFragment should provide access to parquet > FileMetadata > ---------------------------------------------------------------------------------------- > > Key: ARROW-8733 > URL: https://issues.apache.org/jira/browse/ARROW-8733 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python > Reporter: Joris Van den Bossche > Priority: Major > Labels: dataset > Fix For: 1.0.0 > > > Related to ARROW-8062 (as there we will also need a way to expose the global > FileMetadata). But independently, it would be useful to get access to the > FileMetadata on each {{ParquetFileFragment}} (eg to get access to the > statistics). > This would be relatively simple to code on the Python/R side, since we have > access to the file path, and could read the metadata from the file backing > the fragment, and return this as a FileMetadata object. > I am wondering if we want to integrate this with ARROW-8062, since when the > fragments were created from a {{_metadata}} file, a > {{ParquetFileFragment.metadata}} attribute would not need to read it from the > parquet file in this case, but from the global metadata (at least for eg the > row group data). > Another question: what for a ParquetFileFragment that maps to a single row > group? -- This message was sent by Atlassian Jira (v8.3.4#803005)