[ https://issues.apache.org/jira/browse/ARROW-8987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
German I. Ramirez-Espinoza updated ARROW-8987: ---------------------------------------------- Description: Reading functions like {{dataset.dataset}} and {{read_table}} functions in feather, parquet, and csv modules return different exceptions when reading an "empty file" or "missing file", respectively. See table below. It would be idea if all the reading functions return {{FileNotFound}} error when the file is missing and return {{ArrowInvalid}} when the file's empty. Most interesting is the case of {{dataset.dataset}} since the {{format}} parameter modifies the exception behaviour when reading an empty file. ||Function||Missing file||Empty File|| |feather.read_table|FileNotFoundError|ArrowInvalid| |parquet.read_table|OSError|ArrowInvalid| |csv.read_csv|FileNotFoundError|ArrowInvalid| |dataset.dataset "feather"|FileNotFoundError|ArrowInvalid| |dataset.dataset "parquet"|FileNotFoundError|OSError| |dataset.dataset "csv"|FileNotFoundError|ArrowInvalid| Code to reproduce issue: {code:python} import pathlib import sys import tempfile import pyarrow as pa import pyarrow.csv as csv import pyarrow.dataset as dataset import pyarrow.feather as feather import pyarrow.parquet as parquet tempdir = pathlib.Path(tempfile.mkdtemp()) with open(str(tempdir / "empty_feather.feather"), 'wb') as f: pass with open(str(tempdir / "empty_parquet.parquet"), 'wb') as f: pass with open(str(tempdir / "empty_csv.csv"), 'wb') as f: pass # Empty File feather.read_table(str(tempdir / "empty_feather.feather")) parquet.read_table(str(tempdir / "empty_parquet.parquet")) csv.read_csv(str(tempdir / "empty_csv.csv")) dataset.dataset(str(tempdir / "empty_feather.feather"), format="feather") dataset.dataset(str(tempdir / "empty_parquet.parquet"), format="parquet") dataset.dataset(str(tempdir / "empty_csv.csv"), format="csv") # Missing File feather.read_table(str(tempdir / "non_existent.feather")) parquet.read_table(str(tempdir / "non_existent.parquet")) csv.read_csv(str(tempdir / "non_existent.csv")) dataset.dataset(str(tempdir / "non_existent.feather"), format="feather") dataset.dataset(str(tempdir / "non_existent.parquet"), format="parquet") dataset.dataset(str(tempdir / "non_existent.csv"), format="csv") {code} was: Reading functions like {{dataset.dataset}} and {{read_table}} functions in feather, parquet, and csv modules return different exceptions when reading an "empty file" or "missing file", respectively. See table below. Most interesting is the case of {{dataset.dataset }}since the {{format}} parameter modifies the exception behaviour when reading an empty file. ||Function||Missing file||Empty File|| |feather.read_table|FileNotFoundError|ArrowInvalid| |parquet.read_table|OSError|ArrowInvalid| |csv.read_csv|FileNotFoundError|ArrowInvalid| |dataset.dataset "feather"|FileNotFoundError|ArrowInvalid| |dataset.dataset "parquet"|FileNotFoundError|OSError| |dataset.dataset "csv"|FileNotFoundError|ArrowInvalid| Code to reproduce issue: {code:python} import pathlib import sys import tempfile import pyarrow as pa import pyarrow.csv as csv import pyarrow.dataset as dataset import pyarrow.feather as feather import pyarrow.parquet as parquet tempdir = pathlib.Path(tempfile.mkdtemp()) with open(str(tempdir / "empty_feather.feather"), 'wb') as f: pass with open(str(tempdir / "empty_parquet.parquet"), 'wb') as f: pass with open(str(tempdir / "empty_csv.csv"), 'wb') as f: pass # Empty File feather.read_table(str(tempdir / "empty_feather.feather")) parquet.read_table(str(tempdir / "empty_parquet.parquet")) csv.read_csv(str(tempdir / "empty_csv.csv")) dataset.dataset(str(tempdir / "empty_feather.feather"), format="feather") dataset.dataset(str(tempdir / "empty_parquet.parquet"), format="parquet") dataset.dataset(str(tempdir / "empty_csv.csv"), format="csv") # Missing File feather.read_table(str(tempdir / "non_existent.feather")) parquet.read_table(str(tempdir / "non_existent.parquet")) csv.read_csv(str(tempdir / "non_existent.csv")) dataset.dataset(str(tempdir / "non_existent.feather"), format="feather") dataset.dataset(str(tempdir / "non_existent.parquet"), format="parquet") dataset.dataset(str(tempdir / "non_existent.csv"), format="csv") {code} > [C++][Python] Make reading functions to return consistent exceptions > -------------------------------------------------------------------- > > Key: ARROW-8987 > URL: https://issues.apache.org/jira/browse/ARROW-8987 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python > Affects Versions: 0.17.1 > Reporter: German I. Ramirez-Espinoza > Priority: Minor > > Reading functions like {{dataset.dataset}} and {{read_table}} functions in > feather, parquet, and csv modules return different exceptions when reading an > "empty file" or "missing file", respectively. See table below. > It would be idea if all the reading functions return {{FileNotFound}} error > when the file is missing and return {{ArrowInvalid}} when the file's empty. > Most interesting is the case of {{dataset.dataset}} since the {{format}} > parameter modifies the exception behaviour when reading an empty file. > > ||Function||Missing file||Empty File|| > |feather.read_table|FileNotFoundError|ArrowInvalid| > |parquet.read_table|OSError|ArrowInvalid| > |csv.read_csv|FileNotFoundError|ArrowInvalid| > |dataset.dataset "feather"|FileNotFoundError|ArrowInvalid| > |dataset.dataset "parquet"|FileNotFoundError|OSError| > |dataset.dataset "csv"|FileNotFoundError|ArrowInvalid| > > Code to reproduce issue: > {code:python} > import pathlib > import sys > import tempfile > import pyarrow as pa > import pyarrow.csv as csv > import pyarrow.dataset as dataset > import pyarrow.feather as feather > import pyarrow.parquet as parquet > tempdir = pathlib.Path(tempfile.mkdtemp()) > with open(str(tempdir / "empty_feather.feather"), 'wb') as f: > pass > with open(str(tempdir / "empty_parquet.parquet"), 'wb') as f: > pass > with open(str(tempdir / "empty_csv.csv"), 'wb') as f: > pass > # Empty File > feather.read_table(str(tempdir / "empty_feather.feather")) > parquet.read_table(str(tempdir / "empty_parquet.parquet")) > csv.read_csv(str(tempdir / "empty_csv.csv")) > dataset.dataset(str(tempdir / "empty_feather.feather"), format="feather") > dataset.dataset(str(tempdir / "empty_parquet.parquet"), format="parquet") > dataset.dataset(str(tempdir / "empty_csv.csv"), format="csv") > # Missing File > feather.read_table(str(tempdir / "non_existent.feather")) > parquet.read_table(str(tempdir / "non_existent.parquet")) > csv.read_csv(str(tempdir / "non_existent.csv")) > dataset.dataset(str(tempdir / "non_existent.feather"), format="feather") > dataset.dataset(str(tempdir / "non_existent.parquet"), format="parquet") > dataset.dataset(str(tempdir / "non_existent.csv"), format="csv") > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)