Joris Van den Bossche created ARROW-8652:
--------------------------------------------

             Summary: [Python] Test error message when discovering dataset with 
invalid files
                 Key: ARROW-8652
                 URL: https://issues.apache.org/jira/browse/ARROW-8652
             Project: Apache Arrow
          Issue Type: Improvement
          Components: Python
            Reporter: Joris Van den Bossche


There is comment in the test_parquet.py about the Dataset API needing a better 
error message for invalid files:

https://github.com/apache/arrow/blob/ff92a6886ca77515173a50662a1949a792881222/python/pyarrow/tests/test_parquet.py#L3633-L3648

Although, this seems to work now:

{code}
import tempfile 
import pathlib
import pyarrow.dataset as ds                                                    
                                                                                
                                           

tempdir = pathlib.Path(tempfile.mkdtemp()) 

with open(str(tempdir / "data.parquet"), 'wb') as f: 
    pass 

In [10]: ds.dataset(str(tempdir / "data.parquet"), format="parquet")            
                                                                                
                                                   
...
OSError: Could not open parquet input source '/tmp/tmp312vtjmw/data.parquet': 
Invalid: Parquet file size is 0 bytes
{code}

So we need update the test to actually test it instead of skipping.

The only difference with the python ParquetDataset implementation is that the 
datasets API raises an OSError and not an ArrowInvalid error.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to