[jira] [Commented] (ARROW-2079) [Python][C++] Possibly use `_common_metadata` for schema if `_metadata` isn't available

Joris Van den Bossche (Jira) Thu, 04 Jun 2020 06:28:22 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-2079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17125920#comment-17125920
 ]


Joris Van den Bossche commented on ARROW-2079:
----------------------------------------------

So in practice, when using the factory to discover a partitioned parquet 
dataset, the default right now is to "infer" the schema just from reading the 
schema from the first file (and optionally you can let it read all schemas of 
all files and get a "unified" schema from those, or pass an explicit schema).  
So what are the concrete TODO items that we want to change here related to the 
discussion in this issue?

If there is a {{_common_metadata}} file present, use that to read the schema 
instead of the "first" file (and I suppose that's all that we would do with 
this {{_common_metadata}} file?).   
And is the advantage then that this {{_common_metadata}} file is a smaller file 
than the "first" data file?

> [Python][C++] Possibly use `_common_metadata` for schema if `_metadata` isn't 
> available
> ---------------------------------------------------------------------------------------
>
>                 Key: ARROW-2079
>                 URL: https://issues.apache.org/jira/browse/ARROW-2079
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Python
>            Reporter: Jim Crist
>            Priority: Minor
>              Labels: dataset, dataset-parquet-read, parquet
>
> Currently pyarrow's parquet writer only writes `_common_metadata` and not 
> `_metadata`. From what I understand these are intended to contain the dataset 
> schema but not any row group information.
>  
> A few (possibly naive) questions:
>  
> 1. In the `__init__` for `ParquetDataset`, the following lines exist:
> {code:java}
> if self.metadata_path is not None:
>     with self.fs.open(self.metadata_path) as f:
>         self.common_metadata = ParquetFile(f).metadata
> else:
>     self.common_metadata = None
> {code}
> I believe this should use `common_metadata_path` instead of `metadata_path`, 
> as the latter is never written by `pyarrow`, and is given by the `_metadata` 
> file instead of `_common_metadata` (as seemingly intended?).
>  
> 2. In `validate_schemas` I believe an option should exist for using the 
> schema from `_common_metadata` instead of `_metadata`, as pyarrow currently 
> only writes the former, and as far as I can tell `_common_metadata` does 
> include all the schema information needed.
>  
> Perhaps the logic in `validate_schemas` could be ported over to:
>  
> {code:java}
> if self.schema is not None:
>     pass  # schema explicitly provided
> elif self.metadata is not None:
>     self.schema = self.metadata.schema
> elif self.common_metadata is not None:
>     self.schema = self.common_metadata.schema
> else:
>     self.schema = self.pieces[0].get_metadata(open_file).schema{code}
> If these changes are valid, I'd be happy to submit a PR. It's not 100% clear 
> to me the difference between `_common_metadata` and `_metadata`, but I 
> believe the schema in both should be the same. Figured I'd open this for 
> discussion.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-2079) [Python][C++] Possibly use `_common_metadata` for schema if `_metadata` isn't available

Reply via email to