[ 
https://issues.apache.org/jira/browse/ARROW-8987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

German I. Ramirez-Espinoza updated ARROW-8987:
----------------------------------------------
    Description: 
Reading functions like {{dataset.dataset}} and {{read_table}} functions in 
feather, parquet, and csv modules return different exceptions when reading an 
"empty file" or "missing file", respectively. See table below.

It would be idea if all the reading functions return {{FileNotFound}} error 
when the file is missing and return {{ArrowInvalid}} when the file's empty.

Most interesting is the case of {{dataset.dataset}} since the {{format}} 
parameter modifies the exception behaviour when reading an empty file.

 
||Function||Missing file||Empty File||
|feather.read_table|FileNotFoundError|ArrowInvalid|
|parquet.read_table|OSError|ArrowInvalid|
|csv.read_csv|FileNotFoundError|ArrowInvalid|
|dataset.dataset "feather"|FileNotFoundError|ArrowInvalid|
|dataset.dataset "parquet"|FileNotFoundError|OSError|
|dataset.dataset "csv"|FileNotFoundError|ArrowInvalid|

 

Code to reproduce issue:
{code:python}
import pathlib
import sys
import tempfile

import pyarrow as pa

import pyarrow.csv as csv
import pyarrow.dataset as dataset
import pyarrow.feather as feather
import pyarrow.parquet as parquet

tempdir = pathlib.Path(tempfile.mkdtemp())

with open(str(tempdir / "empty_feather.feather"), 'wb') as f:
    pass

with open(str(tempdir / "empty_parquet.parquet"), 'wb') as f:
    pass

with open(str(tempdir / "empty_csv.csv"), 'wb') as f:
    pass

# Empty File
feather.read_table(str(tempdir / "empty_feather.feather"))
parquet.read_table(str(tempdir / "empty_parquet.parquet"))
csv.read_csv(str(tempdir / "empty_csv.csv"))
dataset.dataset(str(tempdir / "empty_feather.feather"), format="feather")
dataset.dataset(str(tempdir / "empty_parquet.parquet"), format="parquet")
dataset.dataset(str(tempdir / "empty_csv.csv"), format="csv")

# Missing File
feather.read_table(str(tempdir / "non_existent.feather"))
parquet.read_table(str(tempdir / "non_existent.parquet"))
csv.read_csv(str(tempdir / "non_existent.csv"))
dataset.dataset(str(tempdir / "non_existent.feather"), format="feather")
dataset.dataset(str(tempdir / "non_existent.parquet"), format="parquet")
dataset.dataset(str(tempdir / "non_existent.csv"), format="csv")

{code}

  was:
Reading functions like {{dataset.dataset}} and {{read_table}} functions in 
feather, parquet, and csv modules return different exceptions when reading an 
"empty file" or "missing file", respectively. See table below.

Most interesting is the case of {{dataset.dataset }}since the {{format}} 
parameter modifies the exception behaviour when reading an empty file.

 
||Function||Missing file||Empty File||
|feather.read_table|FileNotFoundError|ArrowInvalid|
|parquet.read_table|OSError|ArrowInvalid|
|csv.read_csv|FileNotFoundError|ArrowInvalid|
|dataset.dataset "feather"|FileNotFoundError|ArrowInvalid|
|dataset.dataset "parquet"|FileNotFoundError|OSError|
|dataset.dataset "csv"|FileNotFoundError|ArrowInvalid|

 

Code to reproduce issue:
{code:python}
import pathlib
import sys
import tempfile

import pyarrow as pa

import pyarrow.csv as csv
import pyarrow.dataset as dataset
import pyarrow.feather as feather
import pyarrow.parquet as parquet

tempdir = pathlib.Path(tempfile.mkdtemp())

with open(str(tempdir / "empty_feather.feather"), 'wb') as f:
    pass

with open(str(tempdir / "empty_parquet.parquet"), 'wb') as f:
    pass

with open(str(tempdir / "empty_csv.csv"), 'wb') as f:
    pass

# Empty File
feather.read_table(str(tempdir / "empty_feather.feather"))
parquet.read_table(str(tempdir / "empty_parquet.parquet"))
csv.read_csv(str(tempdir / "empty_csv.csv"))
dataset.dataset(str(tempdir / "empty_feather.feather"), format="feather")
dataset.dataset(str(tempdir / "empty_parquet.parquet"), format="parquet")
dataset.dataset(str(tempdir / "empty_csv.csv"), format="csv")

# Missing File
feather.read_table(str(tempdir / "non_existent.feather"))
parquet.read_table(str(tempdir / "non_existent.parquet"))
csv.read_csv(str(tempdir / "non_existent.csv"))
dataset.dataset(str(tempdir / "non_existent.feather"), format="feather")
dataset.dataset(str(tempdir / "non_existent.parquet"), format="parquet")
dataset.dataset(str(tempdir / "non_existent.csv"), format="csv")

{code}


> [C++][Python] Make reading functions to return consistent exceptions
> --------------------------------------------------------------------
>
>                 Key: ARROW-8987
>                 URL: https://issues.apache.org/jira/browse/ARROW-8987
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++, Python
>    Affects Versions: 0.17.1
>            Reporter: German I. Ramirez-Espinoza
>            Priority: Minor
>
> Reading functions like {{dataset.dataset}} and {{read_table}} functions in 
> feather, parquet, and csv modules return different exceptions when reading an 
> "empty file" or "missing file", respectively. See table below.
> It would be idea if all the reading functions return {{FileNotFound}} error 
> when the file is missing and return {{ArrowInvalid}} when the file's empty.
> Most interesting is the case of {{dataset.dataset}} since the {{format}} 
> parameter modifies the exception behaviour when reading an empty file.
>  
> ||Function||Missing file||Empty File||
> |feather.read_table|FileNotFoundError|ArrowInvalid|
> |parquet.read_table|OSError|ArrowInvalid|
> |csv.read_csv|FileNotFoundError|ArrowInvalid|
> |dataset.dataset "feather"|FileNotFoundError|ArrowInvalid|
> |dataset.dataset "parquet"|FileNotFoundError|OSError|
> |dataset.dataset "csv"|FileNotFoundError|ArrowInvalid|
>  
> Code to reproduce issue:
> {code:python}
> import pathlib
> import sys
> import tempfile
> import pyarrow as pa
> import pyarrow.csv as csv
> import pyarrow.dataset as dataset
> import pyarrow.feather as feather
> import pyarrow.parquet as parquet
> tempdir = pathlib.Path(tempfile.mkdtemp())
> with open(str(tempdir / "empty_feather.feather"), 'wb') as f:
>     pass
> with open(str(tempdir / "empty_parquet.parquet"), 'wb') as f:
>     pass
> with open(str(tempdir / "empty_csv.csv"), 'wb') as f:
>     pass
> # Empty File
> feather.read_table(str(tempdir / "empty_feather.feather"))
> parquet.read_table(str(tempdir / "empty_parquet.parquet"))
> csv.read_csv(str(tempdir / "empty_csv.csv"))
> dataset.dataset(str(tempdir / "empty_feather.feather"), format="feather")
> dataset.dataset(str(tempdir / "empty_parquet.parquet"), format="parquet")
> dataset.dataset(str(tempdir / "empty_csv.csv"), format="csv")
> # Missing File
> feather.read_table(str(tempdir / "non_existent.feather"))
> parquet.read_table(str(tempdir / "non_existent.parquet"))
> csv.read_csv(str(tempdir / "non_existent.csv"))
> dataset.dataset(str(tempdir / "non_existent.feather"), format="feather")
> dataset.dataset(str(tempdir / "non_existent.parquet"), format="parquet")
> dataset.dataset(str(tempdir / "non_existent.csv"), format="csv")
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to