[ 
https://issues.apache.org/jira/browse/ARROW-8652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-8652:
-----------------------------------------
    Labels: dataset  (was: )

> [Python] Test error message when discovering dataset with invalid files
> -----------------------------------------------------------------------
>
>                 Key: ARROW-8652
>                 URL: https://issues.apache.org/jira/browse/ARROW-8652
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Python
>            Reporter: Joris Van den Bossche
>            Priority: Minor
>              Labels: dataset
>
> There is comment in the test_parquet.py about the Dataset API needing a 
> better error message for invalid files:
> https://github.com/apache/arrow/blob/ff92a6886ca77515173a50662a1949a792881222/python/pyarrow/tests/test_parquet.py#L3633-L3648
> Although, this seems to work now:
> {code}
> import tempfile 
> import pathlib
> import pyarrow.dataset as ds                                                  
>                                                                               
>                                                
> tempdir = pathlib.Path(tempfile.mkdtemp()) 
> with open(str(tempdir / "data.parquet"), 'wb') as f: 
>     pass 
> In [10]: ds.dataset(str(tempdir / "data.parquet"), format="parquet")          
>                                                                               
>                                                        
> ...
> OSError: Could not open parquet input source '/tmp/tmp312vtjmw/data.parquet': 
> Invalid: Parquet file size is 0 bytes
> {code}
> So we need update the test to actually test it instead of skipping.
> The only difference with the python ParquetDataset implementation is that the 
> datasets API raises an OSError and not an ArrowInvalid error.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to