Joris Van den Bossche created ARROW-8652:
--------------------------------------------
Summary: [Python] Test error message when discovering dataset with
invalid files
Key: ARROW-8652
URL: https://issues.apache.org/jira/browse/ARROW-8652
Project: Apache Arrow
Issue Type: Improvement
Components: Python
Reporter: Joris Van den Bossche
There is comment in the test_parquet.py about the Dataset API needing a better
error message for invalid files:
https://github.com/apache/arrow/blob/ff92a6886ca77515173a50662a1949a792881222/python/pyarrow/tests/test_parquet.py#L3633-L3648
Although, this seems to work now:
{code}
import tempfile
import pathlib
import pyarrow.dataset as ds
tempdir = pathlib.Path(tempfile.mkdtemp())
with open(str(tempdir / "data.parquet"), 'wb') as f:
pass
In [10]: ds.dataset(str(tempdir / "data.parquet"), format="parquet")
...
OSError: Could not open parquet input source '/tmp/tmp312vtjmw/data.parquet':
Invalid: Parquet file size is 0 bytes
{code}
So we need update the test to actually test it instead of skipping.
The only difference with the python ParquetDataset implementation is that the
datasets API raises an OSError and not an ArrowInvalid error.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)