[jira] [Updated] (ARROW-2444) Better handle reading empty parquet files

Jim Crist (JIRA) Tue, 10 Apr 2018 08:38:21 -0700

     [ 
https://issues.apache.org/jira/browse/ARROW-2444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Jim Crist updated ARROW-2444:
-----------------------------
    Description: 
>From [https://github.com/dask/dask/pull/3387#issuecomment-380140003]

 

Currently pyarrow reads empty parts as float64, even if the underlying columns 
have other dtypes. This can cause problems for pandas downstream, as certain 
operations are only valid on certain dtypes, even if the columns are empty.

 

Copying the comment Uwe over:

 
bq. {quote}This is the expected behaviour as an empty string column in Pandas 
is simply an empty column of type object. Sadly object does not tell us much 
about the type of the column at all. We return numpy.float64 in this case as 
it's the most efficient type to store nulls in Pandas.{quote}

{quote}This seems unintuitive at best to me. An empty object column in pandas 
is treated differently in many operations than an empty float64 column (str 
accessor is available, excluded from numeric operations, etc..). Having an 
empty file read in as a different dtype than was written could lead to errors 
in processing code downstream. Would arrow be willing to change this 
behavior?{quote}

We should probably use another method than `field.type.to_pandas_dtype()` in 
this case. The column saved in Parquet should be saved with `NA` as type which 
sadly does not provide enough information. 

We also store the original dtype in the Pandas metadata that is used for the 
actual DataFrame reconstruction later on. If we would also pick up the metadata 
when it was written, we should be able to correctly reconstruct the dtype.

  was:
>From [https://github.com/dask/dask/pull/3387#issuecomment-380140003]

 

Currently pyarrow reads empty parts as float64, even if the underlying columns 
have other dtypes. This can cause problems for pandas downstream, as certain 
operations are only valid on certain dtypes, even if the columns are empty.

 

Copying the comment Uwe over:

 
bq. {quote}This is the expected behaviour as an empty string column in Pandas 
is simply an empty column of type object. Sadly object does not tell us much 
about the type of the column at all. We return numpy.float64 in this case as 
it's the most efficient type to store nulls in Pandas.{quote}

{quote}This seems unintuitive at best to me. An empty object column in pandas 
is treated differently in many operations than an empty float64 column (str 
accessor is available, excluded from numeric operations, etc..). Having an 
empty file read in as a different dtype than was written could lead to errors 
in processing code downstream. Would arrow be willing to change this 
behavior?{quote}

We should probably use another method than `field.type.to_pandas_dtype()` in 
this case. The column saved in Parquet should be saved with `NA` as type which 
sadly does not provide enough information. 

We also store the original dtype in the Pandas metadata that is used for the 
actual DataFrame reconstruction later on. If we would also pick up the metadata 
when it was written, we should be able to correctly reconstruct the dtype.
{quote}


> Better handle reading empty parquet files
> -----------------------------------------
>
>                 Key: ARROW-2444
>                 URL: https://issues.apache.org/jira/browse/ARROW-2444
>             Project: Apache Arrow
>          Issue Type: Improvement
>            Reporter: Jim Crist
>            Priority: Major
>
> From [https://github.com/dask/dask/pull/3387#issuecomment-380140003]
>  
> Currently pyarrow reads empty parts as float64, even if the underlying 
> columns have other dtypes. This can cause problems for pandas downstream, as 
> certain operations are only valid on certain dtypes, even if the columns are 
> empty.
>  
> Copying the comment Uwe over:
>  
> bq. {quote}This is the expected behaviour as an empty string column in Pandas 
> is simply an empty column of type object. Sadly object does not tell us much 
> about the type of the column at all. We return numpy.float64 in this case as 
> it's the most efficient type to store nulls in Pandas.{quote}
> {quote}This seems unintuitive at best to me. An empty object column in pandas 
> is treated differently in many operations than an empty float64 column (str 
> accessor is available, excluded from numeric operations, etc..). Having an 
> empty file read in as a different dtype than was written could lead to errors 
> in processing code downstream. Would arrow be willing to change this 
> behavior?{quote}
> We should probably use another method than `field.type.to_pandas_dtype()` in 
> this case. The column saved in Parquet should be saved with `NA` as type 
> which sadly does not provide enough information. 
> We also store the original dtype in the Pandas metadata that is used for the 
> actual DataFrame reconstruction later on. If we would also pick up the 
> metadata when it was written, we should be able to correctly reconstruct the 
> dtype.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (ARROW-2444) Better handle reading empty parquet files

Reply via email to