[ 
https://issues.apache.org/jira/browse/ARROW-8987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17123438#comment-17123438
 ] 

Joris Van den Bossche commented on ARROW-8987:
----------------------------------------------

[~gire] thanks for this overview! 

That's indeed something that ideally would be made consistent. But I think the 
difference between missing file (FileNotFoundError) and empty file 
(ArrowInvalid) is fine? (as long as it is consistent within each category 
across the different functions)

> [C++][Python] Make reading functions to return consistent exceptions
> --------------------------------------------------------------------
>
>                 Key: ARROW-8987
>                 URL: https://issues.apache.org/jira/browse/ARROW-8987
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++, Python
>    Affects Versions: 0.17.1
>            Reporter: German I. Ramirez-Espinoza
>            Priority: Minor
>
> Reading functions like {{dataset.dataset}} and {{read_table}} functions in 
> feather, parquet, and csv modules return different exceptions when reading an 
> "empty file" or "missing file", respectively. See table below.
> Most interesting is the case of {{dataset.dataset }}since the {{format}} 
> parameter modifies the exception behaviour when reading an empty file.
>  
> ||Function||Missing file||Empty File||
> |feather.read_table|FileNotFoundError|ArrowInvalid|
> |parquet.read_table|OSError|ArrowInvalid|
> |csv.read_csv|FileNotFoundError|ArrowInvalid|
> |dataset.dataset "feather"|FileNotFoundError|ArrowInvalid|
> |dataset.dataset "parquet"|FileNotFoundError|OSError|
> |dataset.dataset "csv"|FileNotFoundError|ArrowInvalid|
>  
> Code to reproduce issue:
> {code:python}
> import pathlib
> import sys
> import tempfile
> import pyarrow as pa
> import pyarrow.csv as csv
> import pyarrow.dataset as dataset
> import pyarrow.feather as feather
> import pyarrow.parquet as parquet
> tempdir = pathlib.Path(tempfile.mkdtemp())
> with open(str(tempdir / "empty_feather.feather"), 'wb') as f:
>     pass
> with open(str(tempdir / "empty_parquet.parquet"), 'wb') as f:
>     pass
> with open(str(tempdir / "empty_csv.csv"), 'wb') as f:
>     pass
> # Empty File
> feather.read_table(str(tempdir / "empty_feather.feather"))
> parquet.read_table(str(tempdir / "empty_parquet.parquet"))
> csv.read_csv(str(tempdir / "empty_csv.csv"))
> dataset.dataset(str(tempdir / "empty_feather.feather"), format="feather")
> dataset.dataset(str(tempdir / "empty_parquet.parquet"), format="parquet")
> dataset.dataset(str(tempdir / "empty_csv.csv"), format="csv")
> # Missing File
> feather.read_table(str(tempdir / "non_existent.feather"))
> parquet.read_table(str(tempdir / "non_existent.parquet"))
> csv.read_csv(str(tempdir / "non_existent.csv"))
> dataset.dataset(str(tempdir / "non_existent.feather"), format="feather")
> dataset.dataset(str(tempdir / "non_existent.parquet"), format="parquet")
> dataset.dataset(str(tempdir / "non_existent.csv"), format="csv")
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to