[
https://issues.apache.org/jira/browse/ARROW-2079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Wes McKinney updated ARROW-2079:
--------------------------------
Summary: [Python] Possibly use `_common_metadata` for schema if `_metadata`
isn't available (was: Possibly use `_common_metadata` for schema if
`_metadata` isn't available)
> [Python] Possibly use `_common_metadata` for schema if `_metadata` isn't
> available
> ----------------------------------------------------------------------------------
>
> Key: ARROW-2079
> URL: https://issues.apache.org/jira/browse/ARROW-2079
> Project: Apache Arrow
> Issue Type: Improvement
> Components: Python
> Reporter: Jim Crist
> Priority: Minor
> Labels: parquet
>
> Currently pyarrow's parquet writer only writes `_common_metadata` and not
> `_metadata`. From what I understand these are intended to contain the dataset
> schema but not any row group information.
>
> A few (possibly naive) questions:
>
> 1. In the `__init__` for `ParquetDataset`, the following lines exist:
> {code:java}
> if self.metadata_path is not None:
> with self.fs.open(self.metadata_path) as f:
> self.common_metadata = ParquetFile(f).metadata
> else:
> self.common_metadata = None
> {code}
> I believe this should use `common_metadata_path` instead of `metadata_path`,
> as the latter is never written by `pyarrow`, and is given by the `_metadata`
> file instead of `_common_metadata` (as seemingly intended?).
>
> 2. In `validate_schemas` I believe an option should exist for using the
> schema from `_common_metadata` instead of `_metadata`, as pyarrow currently
> only writes the former, and as far as I can tell `_common_metadata` does
> include all the schema information needed.
>
> Perhaps the logic in `validate_schemas` could be ported over to:
>
> {code:java}
> if self.schema is not None:
> pass # schema explicitly provided
> elif self.metadata is not None:
> self.schema = self.metadata.schema
> elif self.common_metadata is not None:
> self.schema = self.common_metadata.schema
> else:
> self.schema = self.pieces[0].get_metadata(open_file).schema{code}
> If these changes are valid, I'd be happy to submit a PR. It's not 100% clear
> to me the difference between `_common_metadata` and `_metadata`, but I
> believe the schema in both should be the same. Figured I'd open this for
> discussion.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)