[ 
https://issues.apache.org/jira/browse/ARROW-8378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney closed ARROW-8378.
-------------------------------
    Resolution: Not A Problem

What [~jorisvandenbossche] wrote is right -- I missed that 
{{astype(np.unicode_)}} was coercing {{None}} to {{'None'}}, so this is not 
something that we're able to "fix" per se. 

> [Python] "empty" dtype metadata leads to wrong Parquet column type
> ------------------------------------------------------------------
>
>                 Key: ARROW-8378
>                 URL: https://issues.apache.org/jira/browse/ARROW-8378
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.16.0
>         Environment: Python: 3.7.6
> Pandas: 0.24.1, 0.25.3, 1.0.3
> Pyarrow: 0.16.0
> OS: OSX 10.15.3
>            Reporter: Diego Argueta
>            Priority: Major
>
> Run the following code with Pandas 0.24.x-1.0.x, and PyArrow 0.16.0 on Python 
> 3.7:
> {code:python}
> import pandas as pd
> import numpy as np
> df_1 = pd.DataFrame({'col': [None, None, None]})
> df_1.col = df_1.col.astype(np.unicode_)
> df_1.to_parquet('right.parq', engine='pyarrow')
> series = pd.Series([None, None, None], dtype=np.unicode_)
> df_2 = pd.DataFrame({'col': series})
> df_2.to_parquet('wrong.parq', engine='pyarrow')
> {code}
> Examine the Parquet column type for each file (I use 
> [parquet-tools|https://github.com/wesleypeck/parquet-tools]). {{right.parq}} 
> has the expected UTF-8 string type. {{wrong.parq}} has an {{INT32}}.
> The following metadata is stored in the Parquet files:
> {{right.parq}}
> {code:json}
> {
>   "column_indexes": [],
>   "columns": [
>     {
>       "field_name": "col",
>       "metadata": null,
>       "name": "col",
>       "numpy_type": "object",
>       "pandas_type": "unicode"
>     }
>   ],
>   "index_columns": [],
>   "pandas_version": "0.24.1"
> }
> {code}
> {{wrong.parq}}
> {code:json}
> {
>   "column_indexes": [],
>   "columns": [
>     {
>       "field_name": "col",
>       "metadata": null,
>       "name": "col",
>       "numpy_type": "object",
>       "pandas_type": "empty"
>     }
>   ],
>   "index_columns": [],
>   "pandas_version": "0.24.1"
> }
> {code}
> The difference between the two is that the {{pandas_type}} for the incorrect 
> file is "empty" rather than the expected "unicode". PyArrow misinterprets 
> this and defaults to a 32-bit integer column.
> The incorrect datatype will cause Redshift to reject the file when we try to 
> read it because the column type in the file doesn't match the column type in 
> the database table.
> I originally filed this as a bug in Pandas (see [this 
> ticket|https://github.com/pandas-dev/pandas/issues/25326]) but they punted me 
> over here because the dtype conversion is handled in PyArrow. I'm not sure 
> how you'd handle this here.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to