[
https://issues.apache.org/jira/browse/ARROW-2079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17125920#comment-17125920
]
Joris Van den Bossche commented on ARROW-2079:
----------------------------------------------
So in practice, when using the factory to discover a partitioned parquet
dataset, the default right now is to "infer" the schema just from reading the
schema from the first file (and optionally you can let it read all schemas of
all files and get a "unified" schema from those, or pass an explicit schema).
So what are the concrete TODO items that we want to change here related to the
discussion in this issue?
If there is a {{_common_metadata}} file present, use that to read the schema
instead of the "first" file (and I suppose that's all that we would do with
this {{_common_metadata}} file?).
And is the advantage then that this {{_common_metadata}} file is a smaller file
than the "first" data file?
> [Python][C++] Possibly use `_common_metadata` for schema if `_metadata` isn't
> available
> ---------------------------------------------------------------------------------------
>
> Key: ARROW-2079
> URL: https://issues.apache.org/jira/browse/ARROW-2079
> Project: Apache Arrow
> Issue Type: Improvement
> Components: Python
> Reporter: Jim Crist
> Priority: Minor
> Labels: dataset, dataset-parquet-read, parquet
>
> Currently pyarrow's parquet writer only writes `_common_metadata` and not
> `_metadata`. From what I understand these are intended to contain the dataset
> schema but not any row group information.
>
> A few (possibly naive) questions:
>
> 1. In the `__init__` for `ParquetDataset`, the following lines exist:
> {code:java}
> if self.metadata_path is not None:
> with self.fs.open(self.metadata_path) as f:
> self.common_metadata = ParquetFile(f).metadata
> else:
> self.common_metadata = None
> {code}
> I believe this should use `common_metadata_path` instead of `metadata_path`,
> as the latter is never written by `pyarrow`, and is given by the `_metadata`
> file instead of `_common_metadata` (as seemingly intended?).
>
> 2. In `validate_schemas` I believe an option should exist for using the
> schema from `_common_metadata` instead of `_metadata`, as pyarrow currently
> only writes the former, and as far as I can tell `_common_metadata` does
> include all the schema information needed.
>
> Perhaps the logic in `validate_schemas` could be ported over to:
>
> {code:java}
> if self.schema is not None:
> pass # schema explicitly provided
> elif self.metadata is not None:
> self.schema = self.metadata.schema
> elif self.common_metadata is not None:
> self.schema = self.common_metadata.schema
> else:
> self.schema = self.pieces[0].get_metadata(open_file).schema{code}
> If these changes are valid, I'd be happy to submit a PR. It's not 100% clear
> to me the difference between `_common_metadata` and `_metadata`, but I
> believe the schema in both should be the same. Figured I'd open this for
> discussion.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)