[jira] [Created] (ARROW-2079) Possibly use `_common_metadata` for schema if `_metadata` isn't available

Jim Crist (JIRA) Thu, 01 Feb 2018 13:56:50 -0800

Jim Crist created ARROW-2079:
--------------------------------

             Summary: Possibly use `_common_metadata` for schema if `_metadata` 
isn't available
                 Key: ARROW-2079
                 URL: https://issues.apache.org/jira/browse/ARROW-2079
             Project: Apache Arrow
          Issue Type: Improvement
          Components: Python
            Reporter: Jim Crist



Currently pyarrow's parquet writer only writes `_common_metadata` and not 
`_metadata`. From what I understand these are intended to contain the dataset 
schema but not any row group information.

 

A few (possibly naive) questions:

 

1. In the `__init__` for `ParquetDataset`, the following lines exist:
{code:java}
if self.metadata_path is not None:
    with self.fs.open(self.metadata_path) as f:
        self.common_metadata = ParquetFile(f).metadata
else:
    self.common_metadata = None
{code}
I believe this should use `common_metadata_path` instead of `metadata_path`, as 
the latter is never written by `pyarrow`, and is given by the `_metadata` file 
instead of `_common_metadata` (as seemingly intended?).

 

2. In `validate_schemas` I believe an option should exist for using the schema 
from `_common_metadata` instead of `_metadata`, as pyarrow currently only 
writes the former, and as far as I can tell `_common_metadata` does include all 
the schema information needed.

 

Perhaps the logic in `validate_schemas` could be ported over to:

 
{code:java}
if self.schema is not None:
    pass  # schema explicitly provided
elif self.metadata is not None:
    self.schema = self.metadata.schema
elif self.common_metadata is not None:
    self.schema = self.common_metadata.schema
else:
    self.schema = self.pieces[0].get_metadata(open_file).schema{code}
If these changes are valid, I'd be happy to submit a PR. It's not 100% clear to 
me the difference between `_common_metadata` and `_metadata`, but I believe the 
schema in both should be the same. Figured I'd open this for discussion.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (ARROW-2079) Possibly use `_common_metadata` for schema if `_metadata` isn't available

Reply via email to