[ https://issues.apache.org/jira/browse/ARROW-8987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17123438#comment-17123438 ]
Joris Van den Bossche commented on ARROW-8987: ---------------------------------------------- [~gire] thanks for this overview! That's indeed something that ideally would be made consistent. But I think the difference between missing file (FileNotFoundError) and empty file (ArrowInvalid) is fine? (as long as it is consistent within each category across the different functions) > [C++][Python] Make reading functions to return consistent exceptions > -------------------------------------------------------------------- > > Key: ARROW-8987 > URL: https://issues.apache.org/jira/browse/ARROW-8987 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python > Affects Versions: 0.17.1 > Reporter: German I. Ramirez-Espinoza > Priority: Minor > > Reading functions like {{dataset.dataset}} and {{read_table}} functions in > feather, parquet, and csv modules return different exceptions when reading an > "empty file" or "missing file", respectively. See table below. > Most interesting is the case of {{dataset.dataset }}since the {{format}} > parameter modifies the exception behaviour when reading an empty file. > > ||Function||Missing file||Empty File|| > |feather.read_table|FileNotFoundError|ArrowInvalid| > |parquet.read_table|OSError|ArrowInvalid| > |csv.read_csv|FileNotFoundError|ArrowInvalid| > |dataset.dataset "feather"|FileNotFoundError|ArrowInvalid| > |dataset.dataset "parquet"|FileNotFoundError|OSError| > |dataset.dataset "csv"|FileNotFoundError|ArrowInvalid| > > Code to reproduce issue: > {code:python} > import pathlib > import sys > import tempfile > import pyarrow as pa > import pyarrow.csv as csv > import pyarrow.dataset as dataset > import pyarrow.feather as feather > import pyarrow.parquet as parquet > tempdir = pathlib.Path(tempfile.mkdtemp()) > with open(str(tempdir / "empty_feather.feather"), 'wb') as f: > pass > with open(str(tempdir / "empty_parquet.parquet"), 'wb') as f: > pass > with open(str(tempdir / "empty_csv.csv"), 'wb') as f: > pass > # Empty File > feather.read_table(str(tempdir / "empty_feather.feather")) > parquet.read_table(str(tempdir / "empty_parquet.parquet")) > csv.read_csv(str(tempdir / "empty_csv.csv")) > dataset.dataset(str(tempdir / "empty_feather.feather"), format="feather") > dataset.dataset(str(tempdir / "empty_parquet.parquet"), format="parquet") > dataset.dataset(str(tempdir / "empty_csv.csv"), format="csv") > # Missing File > feather.read_table(str(tempdir / "non_existent.feather")) > parquet.read_table(str(tempdir / "non_existent.parquet")) > csv.read_csv(str(tempdir / "non_existent.csv")) > dataset.dataset(str(tempdir / "non_existent.feather"), format="feather") > dataset.dataset(str(tempdir / "non_existent.parquet"), format="parquet") > dataset.dataset(str(tempdir / "non_existent.csv"), format="csv") > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)