[
https://issues.apache.org/jira/browse/ARROW-2444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Neal Richardson resolved ARROW-2444.
------------------------------------
Assignee: Joris Van den Bossche
Resolution: Fixed
Sounds like this is resolved now; please open a new issue if there's more work
to do.
> [Python][C++] Better handle reading empty parquet files
> -------------------------------------------------------
>
> Key: ARROW-2444
> URL: https://issues.apache.org/jira/browse/ARROW-2444
> Project: Apache Arrow
> Issue Type: Improvement
> Components: Python
> Reporter: Jim Crist
> Assignee: Joris Van den Bossche
> Priority: Major
> Labels: dataset, dataset-parquet-read, parquet
> Fix For: 1.0.0
>
>
> From [https://github.com/dask/dask/pull/3387#issuecomment-380140003]
>
> Currently pyarrow reads empty parts as float64, even if the underlying
> columns have other dtypes. This can cause problems for pandas downstream, as
> certain operations are only valid on certain dtypes, even if the columns are
> empty.
>
> Copying the comment Uwe over:
>
> bq. {quote}This is the expected behaviour as an empty string column in Pandas
> is simply an empty column of type object. Sadly object does not tell us much
> about the type of the column at all. We return numpy.float64 in this case as
> it's the most efficient type to store nulls in Pandas.{quote}
> {quote}This seems unintuitive at best to me. An empty object column in pandas
> is treated differently in many operations than an empty float64 column (str
> accessor is available, excluded from numeric operations, etc..). Having an
> empty file read in as a different dtype than was written could lead to errors
> in processing code downstream. Would arrow be willing to change this
> behavior?{quote}
> We should probably use another method than `field.type.to_pandas_dtype()` in
> this case. The column saved in Parquet should be saved with `NA` as type
> which sadly does not provide enough information.
> We also store the original dtype in the Pandas metadata that is used for the
> actual DataFrame reconstruction later on. If we would also pick up the
> metadata when it was written, we should be able to correctly reconstruct the
> dtype.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)