[
https://issues.apache.org/jira/browse/ARROW-8378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Wes McKinney closed ARROW-8378.
-------------------------------
Resolution: Not A Problem
What [~jorisvandenbossche] wrote is right -- I missed that
{{astype(np.unicode_)}} was coercing {{None}} to {{'None'}}, so this is not
something that we're able to "fix" per se.
> [Python] "empty" dtype metadata leads to wrong Parquet column type
> ------------------------------------------------------------------
>
> Key: ARROW-8378
> URL: https://issues.apache.org/jira/browse/ARROW-8378
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Affects Versions: 0.16.0
> Environment: Python: 3.7.6
> Pandas: 0.24.1, 0.25.3, 1.0.3
> Pyarrow: 0.16.0
> OS: OSX 10.15.3
> Reporter: Diego Argueta
> Priority: Major
>
> Run the following code with Pandas 0.24.x-1.0.x, and PyArrow 0.16.0 on Python
> 3.7:
> {code:python}
> import pandas as pd
> import numpy as np
> df_1 = pd.DataFrame({'col': [None, None, None]})
> df_1.col = df_1.col.astype(np.unicode_)
> df_1.to_parquet('right.parq', engine='pyarrow')
> series = pd.Series([None, None, None], dtype=np.unicode_)
> df_2 = pd.DataFrame({'col': series})
> df_2.to_parquet('wrong.parq', engine='pyarrow')
> {code}
> Examine the Parquet column type for each file (I use
> [parquet-tools|https://github.com/wesleypeck/parquet-tools]). {{right.parq}}
> has the expected UTF-8 string type. {{wrong.parq}} has an {{INT32}}.
> The following metadata is stored in the Parquet files:
> {{right.parq}}
> {code:json}
> {
> "column_indexes": [],
> "columns": [
> {
> "field_name": "col",
> "metadata": null,
> "name": "col",
> "numpy_type": "object",
> "pandas_type": "unicode"
> }
> ],
> "index_columns": [],
> "pandas_version": "0.24.1"
> }
> {code}
> {{wrong.parq}}
> {code:json}
> {
> "column_indexes": [],
> "columns": [
> {
> "field_name": "col",
> "metadata": null,
> "name": "col",
> "numpy_type": "object",
> "pandas_type": "empty"
> }
> ],
> "index_columns": [],
> "pandas_version": "0.24.1"
> }
> {code}
> The difference between the two is that the {{pandas_type}} for the incorrect
> file is "empty" rather than the expected "unicode". PyArrow misinterprets
> this and defaults to a 32-bit integer column.
> The incorrect datatype will cause Redshift to reject the file when we try to
> read it because the column type in the file doesn't match the column type in
> the database table.
> I originally filed this as a bug in Pandas (see [this
> ticket|https://github.com/pandas-dev/pandas/issues/25326]) but they punted me
> over here because the dtype conversion is handled in PyArrow. I'm not sure
> how you'd handle this here.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)