[ https://issues.apache.org/jira/browse/ARROW-8652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Joris Van den Bossche updated ARROW-8652: ----------------------------------------- Labels: dataset (was: ) > [Python] Test error message when discovering dataset with invalid files > ----------------------------------------------------------------------- > > Key: ARROW-8652 > URL: https://issues.apache.org/jira/browse/ARROW-8652 > Project: Apache Arrow > Issue Type: Improvement > Components: Python > Reporter: Joris Van den Bossche > Priority: Minor > Labels: dataset > > There is comment in the test_parquet.py about the Dataset API needing a > better error message for invalid files: > https://github.com/apache/arrow/blob/ff92a6886ca77515173a50662a1949a792881222/python/pyarrow/tests/test_parquet.py#L3633-L3648 > Although, this seems to work now: > {code} > import tempfile > import pathlib > import pyarrow.dataset as ds > > > tempdir = pathlib.Path(tempfile.mkdtemp()) > with open(str(tempdir / "data.parquet"), 'wb') as f: > pass > In [10]: ds.dataset(str(tempdir / "data.parquet"), format="parquet") > > > ... > OSError: Could not open parquet input source '/tmp/tmp312vtjmw/data.parquet': > Invalid: Parquet file size is 0 bytes > {code} > So we need update the test to actually test it instead of skipping. > The only difference with the python ParquetDataset implementation is that the > datasets API raises an OSError and not an ArrowInvalid error. -- This message was sent by Atlassian Jira (v8.3.4#803005)