[jira] [Commented] (ARROW-5310) [Python] better error message on creating ParquetDataset from empty directory

Joris Van den Bossche (Jira) Tue, 17 Mar 2020 08:08:10 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-5310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17060984#comment-17060984
 ]


Joris Van den Bossche commented on ARROW-5310:
----------------------------------------------

This works now with the new datasets API:

{code}
In [21]: dataset = ds.dataset("notebooks-arrow/test_empty_dir/")                
                                                                                
                                                   

In [22]: dataset.schema                                                         
                                                                                
                                                   
Out[22]: 

In [23]: dataset.to_table().to_pandas()                                         
                                                                                
                                                   
Out[23]: 
Empty DataFrame
Columns: []
Index: []
{code}

So once we use the datasets API under the hood in pyarrow.parquet (ARROW-8039), 
this issue should be solved (might want to add a test for it)

> [Python] better error message on creating ParquetDataset from empty directory
> -----------------------------------------------------------------------------
>
>                 Key: ARROW-5310
>                 URL: https://issues.apache.org/jira/browse/ARROW-5310
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>            Reporter: Joris Van den Bossche
>            Priority: Minor
>              Labels: dataset, dataset-parquet-read, parquet
>
> Currently, you get when {{path}} is an existing but empty directory:
> {code:python}
> >>> dataset = pq.ParquetDataset(path)
> ---------------------------------------------------------------------------
> IndexError                                Traceback (most recent call last)
> <ipython-input-16-346f72ae525e> in <module>
> ----> 1 dataset = pq.ParquetDataset(path)
> ~/scipy/repos/arrow/python/pyarrow/parquet.py in __init__(self, 
> path_or_paths, filesystem, schema, metadata, split_row_groups, 
> validate_schema, filters, metadata_nthreads, memory_map)
>     989 
>     990         if validate_schema:
> --> 991             self.validate_schemas()
>     992 
>     993         if filters is not None:
> ~/scipy/repos/arrow/python/pyarrow/parquet.py in validate_schemas(self)
>    1025                 self.schema = self.common_metadata.schema
>    1026             else:
> -> 1027                 self.schema = self.pieces[0].get_metadata().schema
>    1028         elif self.schema is None:
>    1029             self.schema = self.metadata.schema
> IndexError: list index out of range
> {code}
> That could be a nicer error message. 
> Unless we actually want to allow this? (although I am not sure there are good 
> use cases of empty directories to support this, because from an empty 
> directory we cannot get any schema or metadata information?) 
> It is only failing when validating the schemas, so with 
> {{validate_schema=False}} it actually returns a ParquetDataset object, just 
> with an empty list for {{pieces}} and no schema. So it would be easy to not 
> error when validating the schemas as well for this empty-directory case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-5310) [Python] better error message on creating ParquetDataset from empty directory

Reply via email to