[ https://issues.apache.org/jira/browse/ARROW-4076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Francois Saint-Jacques updated ARROW-4076: ------------------------------------------ Labels: dataset datasets easyfix parquet pull-request-available (was: datasets easyfix parquet pull-request-available) > [Python] schema validation and filters > -------------------------------------- > > Key: ARROW-4076 > URL: https://issues.apache.org/jira/browse/ARROW-4076 > Project: Apache Arrow > Issue Type: Bug > Components: Python > Reporter: George Sakkis > Assignee: Joris Van den Bossche > Priority: Minor > Labels: dataset, datasets, easyfix, parquet, > pull-request-available > Fix For: 0.14.0 > > Time Spent: 1.5h > Remaining Estimate: 0h > > Currently [schema > validation|https://github.com/apache/arrow/blob/758bd557584107cb336cbc3422744dacd93978af/python/pyarrow/parquet.py#L900] > of {{ParquetDataset}} takes place before filtering. This may raise a > {{ValueError}} if the schema is different in some dataset pieces, even if > these pieces would be subsequently filtered out. I think validation should > happen after filtering to prevent such spurious errors: > {noformat} > --- a/pyarrow/parquet.py > +++ b/pyarrow/parquet.py > @@ -878,13 +878,13 @@ > if split_row_groups: > raise NotImplementedError("split_row_groups not yet implemented") > > - if validate_schema: > - self.validate_schemas() > - > if filters is not None: > filters = _check_filters(filters) > self._filter(filters) > > + if validate_schema: > + self.validate_schemas() > + > def validate_schemas(self): > open_file = self._get_open_file_func() > {noformat} -- This message was sent by Atlassian Jira (v8.3.2#803003)