[
https://issues.apache.org/jira/browse/ARROW-2444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Wes McKinney updated ARROW-2444:
--------------------------------
Labels: parquet (was: )
> Better handle reading empty parquet files
> -----------------------------------------
>
> Key: ARROW-2444
> URL: https://issues.apache.org/jira/browse/ARROW-2444
> Project: Apache Arrow
> Issue Type: Improvement
> Reporter: Jim Crist
> Priority: Major
> Labels: parquet
>
> From [https://github.com/dask/dask/pull/3387#issuecomment-380140003]
>
> Currently pyarrow reads empty parts as float64, even if the underlying
> columns have other dtypes. This can cause problems for pandas downstream, as
> certain operations are only valid on certain dtypes, even if the columns are
> empty.
>
> Copying the comment Uwe over:
>
> bq. {quote}This is the expected behaviour as an empty string column in Pandas
> is simply an empty column of type object. Sadly object does not tell us much
> about the type of the column at all. We return numpy.float64 in this case as
> it's the most efficient type to store nulls in Pandas.{quote}
> {quote}This seems unintuitive at best to me. An empty object column in pandas
> is treated differently in many operations than an empty float64 column (str
> accessor is available, excluded from numeric operations, etc..). Having an
> empty file read in as a different dtype than was written could lead to errors
> in processing code downstream. Would arrow be willing to change this
> behavior?{quote}
> We should probably use another method than `field.type.to_pandas_dtype()` in
> this case. The column saved in Parquet should be saved with `NA` as type
> which sadly does not provide enough information.
> We also store the original dtype in the Pandas metadata that is used for the
> actual DataFrame reconstruction later on. If we would also pick up the
> metadata when it was written, we should be able to correctly reconstruct the
> dtype.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)