[
https://issues.apache.org/jira/browse/ARROW-8987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
German I. Ramirez-Espinoza updated ARROW-8987:
----------------------------------------------
Description:
Reading functions like {{dataset.dataset}} and {{read_table}} functions in
feather, parquet, and csv modules return different exceptions when reading an
"empty file" or "missing file", respectively. See table below.
It would be idea if all the reading functions return {{FileNotFound}} error
when the file is missing and return {{ArrowInvalid}} when the file's empty.
Most interesting is the case of {{dataset.dataset}} since the {{format}}
parameter modifies the exception behaviour when reading an empty file.
||Function||Missing file||Empty File||
|feather.read_table|FileNotFoundError|ArrowInvalid|
|parquet.read_table|OSError|ArrowInvalid|
|csv.read_csv|FileNotFoundError|ArrowInvalid|
|dataset.dataset "feather"|FileNotFoundError|ArrowInvalid|
|dataset.dataset "parquet"|FileNotFoundError|OSError|
|dataset.dataset "csv"|FileNotFoundError|ArrowInvalid|
Code to reproduce issue:
{code:python}
import pathlib
import sys
import tempfile
import pyarrow as pa
import pyarrow.csv as csv
import pyarrow.dataset as dataset
import pyarrow.feather as feather
import pyarrow.parquet as parquet
tempdir = pathlib.Path(tempfile.mkdtemp())
with open(str(tempdir / "empty_feather.feather"), 'wb') as f:
pass
with open(str(tempdir / "empty_parquet.parquet"), 'wb') as f:
pass
with open(str(tempdir / "empty_csv.csv"), 'wb') as f:
pass
# Empty File
feather.read_table(str(tempdir / "empty_feather.feather"))
parquet.read_table(str(tempdir / "empty_parquet.parquet"))
csv.read_csv(str(tempdir / "empty_csv.csv"))
dataset.dataset(str(tempdir / "empty_feather.feather"), format="feather")
dataset.dataset(str(tempdir / "empty_parquet.parquet"), format="parquet")
dataset.dataset(str(tempdir / "empty_csv.csv"), format="csv")
# Missing File
feather.read_table(str(tempdir / "non_existent.feather"))
parquet.read_table(str(tempdir / "non_existent.parquet"))
csv.read_csv(str(tempdir / "non_existent.csv"))
dataset.dataset(str(tempdir / "non_existent.feather"), format="feather")
dataset.dataset(str(tempdir / "non_existent.parquet"), format="parquet")
dataset.dataset(str(tempdir / "non_existent.csv"), format="csv")
{code}
was:
Reading functions like {{dataset.dataset}} and {{read_table}} functions in
feather, parquet, and csv modules return different exceptions when reading an
"empty file" or "missing file", respectively. See table below.
Most interesting is the case of {{dataset.dataset }}since the {{format}}
parameter modifies the exception behaviour when reading an empty file.
||Function||Missing file||Empty File||
|feather.read_table|FileNotFoundError|ArrowInvalid|
|parquet.read_table|OSError|ArrowInvalid|
|csv.read_csv|FileNotFoundError|ArrowInvalid|
|dataset.dataset "feather"|FileNotFoundError|ArrowInvalid|
|dataset.dataset "parquet"|FileNotFoundError|OSError|
|dataset.dataset "csv"|FileNotFoundError|ArrowInvalid|
Code to reproduce issue:
{code:python}
import pathlib
import sys
import tempfile
import pyarrow as pa
import pyarrow.csv as csv
import pyarrow.dataset as dataset
import pyarrow.feather as feather
import pyarrow.parquet as parquet
tempdir = pathlib.Path(tempfile.mkdtemp())
with open(str(tempdir / "empty_feather.feather"), 'wb') as f:
pass
with open(str(tempdir / "empty_parquet.parquet"), 'wb') as f:
pass
with open(str(tempdir / "empty_csv.csv"), 'wb') as f:
pass
# Empty File
feather.read_table(str(tempdir / "empty_feather.feather"))
parquet.read_table(str(tempdir / "empty_parquet.parquet"))
csv.read_csv(str(tempdir / "empty_csv.csv"))
dataset.dataset(str(tempdir / "empty_feather.feather"), format="feather")
dataset.dataset(str(tempdir / "empty_parquet.parquet"), format="parquet")
dataset.dataset(str(tempdir / "empty_csv.csv"), format="csv")
# Missing File
feather.read_table(str(tempdir / "non_existent.feather"))
parquet.read_table(str(tempdir / "non_existent.parquet"))
csv.read_csv(str(tempdir / "non_existent.csv"))
dataset.dataset(str(tempdir / "non_existent.feather"), format="feather")
dataset.dataset(str(tempdir / "non_existent.parquet"), format="parquet")
dataset.dataset(str(tempdir / "non_existent.csv"), format="csv")
{code}
> [C++][Python] Make reading functions to return consistent exceptions
> --------------------------------------------------------------------
>
> Key: ARROW-8987
> URL: https://issues.apache.org/jira/browse/ARROW-8987
> Project: Apache Arrow
> Issue Type: Bug
> Components: C++, Python
> Affects Versions: 0.17.1
> Reporter: German I. Ramirez-Espinoza
> Priority: Minor
>
> Reading functions like {{dataset.dataset}} and {{read_table}} functions in
> feather, parquet, and csv modules return different exceptions when reading an
> "empty file" or "missing file", respectively. See table below.
> It would be idea if all the reading functions return {{FileNotFound}} error
> when the file is missing and return {{ArrowInvalid}} when the file's empty.
> Most interesting is the case of {{dataset.dataset}} since the {{format}}
> parameter modifies the exception behaviour when reading an empty file.
>
> ||Function||Missing file||Empty File||
> |feather.read_table|FileNotFoundError|ArrowInvalid|
> |parquet.read_table|OSError|ArrowInvalid|
> |csv.read_csv|FileNotFoundError|ArrowInvalid|
> |dataset.dataset "feather"|FileNotFoundError|ArrowInvalid|
> |dataset.dataset "parquet"|FileNotFoundError|OSError|
> |dataset.dataset "csv"|FileNotFoundError|ArrowInvalid|
>
> Code to reproduce issue:
> {code:python}
> import pathlib
> import sys
> import tempfile
> import pyarrow as pa
> import pyarrow.csv as csv
> import pyarrow.dataset as dataset
> import pyarrow.feather as feather
> import pyarrow.parquet as parquet
> tempdir = pathlib.Path(tempfile.mkdtemp())
> with open(str(tempdir / "empty_feather.feather"), 'wb') as f:
> pass
> with open(str(tempdir / "empty_parquet.parquet"), 'wb') as f:
> pass
> with open(str(tempdir / "empty_csv.csv"), 'wb') as f:
> pass
> # Empty File
> feather.read_table(str(tempdir / "empty_feather.feather"))
> parquet.read_table(str(tempdir / "empty_parquet.parquet"))
> csv.read_csv(str(tempdir / "empty_csv.csv"))
> dataset.dataset(str(tempdir / "empty_feather.feather"), format="feather")
> dataset.dataset(str(tempdir / "empty_parquet.parquet"), format="parquet")
> dataset.dataset(str(tempdir / "empty_csv.csv"), format="csv")
> # Missing File
> feather.read_table(str(tempdir / "non_existent.feather"))
> parquet.read_table(str(tempdir / "non_existent.parquet"))
> csv.read_csv(str(tempdir / "non_existent.csv"))
> dataset.dataset(str(tempdir / "non_existent.feather"), format="feather")
> dataset.dataset(str(tempdir / "non_existent.parquet"), format="parquet")
> dataset.dataset(str(tempdir / "non_existent.csv"), format="csv")
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)