[
https://issues.apache.org/jira/browse/ARROW-8652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Joris Van den Bossche updated ARROW-8652:
-----------------------------------------
Labels: dataset (was: )
> [Python] Test error message when discovering dataset with invalid files
> -----------------------------------------------------------------------
>
> Key: ARROW-8652
> URL: https://issues.apache.org/jira/browse/ARROW-8652
> Project: Apache Arrow
> Issue Type: Improvement
> Components: Python
> Reporter: Joris Van den Bossche
> Priority: Minor
> Labels: dataset
>
> There is comment in the test_parquet.py about the Dataset API needing a
> better error message for invalid files:
> https://github.com/apache/arrow/blob/ff92a6886ca77515173a50662a1949a792881222/python/pyarrow/tests/test_parquet.py#L3633-L3648
> Although, this seems to work now:
> {code}
> import tempfile
> import pathlib
> import pyarrow.dataset as ds
>
>
> tempdir = pathlib.Path(tempfile.mkdtemp())
> with open(str(tempdir / "data.parquet"), 'wb') as f:
> pass
> In [10]: ds.dataset(str(tempdir / "data.parquet"), format="parquet")
>
>
> ...
> OSError: Could not open parquet input source '/tmp/tmp312vtjmw/data.parquet':
> Invalid: Parquet file size is 0 bytes
> {code}
> So we need update the test to actually test it instead of skipping.
> The only difference with the python ParquetDataset implementation is that the
> datasets API raises an OSError and not an ArrowInvalid error.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)