[jira] [Created] (ARROW-8378) [Python] "empty" dtype metadata leads to wrong Parquet column type

Diego Argueta (Jira) Wed, 08 Apr 2020 16:25:14 -0700

Diego Argueta created ARROW-8378:
------------------------------------

             Summary: [Python] "empty" dtype metadata leads to wrong Parquet 
column type
                 Key: ARROW-8378
                 URL: https://issues.apache.org/jira/browse/ARROW-8378
             Project: Apache Arrow
          Issue Type: Bug
          Components: Python
    Affects Versions: 0.16.0
         Environment: Python: 3.7.6
Pandas: 0.24.1, 0.25.3, 1.0.3
Pyarrow: 0.16.0
OS: OSX 10.15.3
            Reporter: Diego Argueta



Run the following code with Pandas 0.24.x-1.0.x, and PyArrow 0.16.0 on Python 
3.7:
{code:python}
import pandas as pd
import numpy as np

df_1 = pd.DataFrame({'col': [None, None, None]})
df_1.col = df_1.col.astype(np.unicode_)
df_1.to_parquet('right.parq', engine='pyarrow')

series = pd.Series([None, None, None], dtype=np.unicode_)
df_2 = pd.DataFrame({'col': series})
df_2.to_parquet('wrong.parq', engine='pyarrow')
{code}
Examine the Parquet column type for each file (I use 
[parquet-tools|https://github.com/wesleypeck/parquet-tools]). {{right.parq}} 
has the expected UTF-8 string type. {{wrong.parq}} has an {{INT32}}.

The following metadata is stored in the Parquet files:

{{right.parq}}
{code:json}
{
  "column_indexes": [],
  "columns": [
    {
      "field_name": "col",
      "metadata": null,
      "name": "col",
      "numpy_type": "object",
      "pandas_type": "unicode"
    }
  ],
  "index_columns": [],
  "pandas_version": "0.24.1"
}
{code}
{{wrong.parq}}
{code:json}
{
  "column_indexes": [],
  "columns": [
    {
      "field_name": "col",
      "metadata": null,
      "name": "col",
      "numpy_type": "object",
      "pandas_type": "empty"
    }
  ],
  "index_columns": [],
  "pandas_version": "0.24.1"
}
{code}
The difference between the two is that the {{pandas_type}} for the incorrect 
file is "empty" rather than the expected "unicode". PyArrow misinterprets this 
and defaults to a 32-bit integer column.

The incorrect datatype will cause Redshift to reject the file when we try to 
read it because the column type in the file doesn't match the column type in 
the database table.

I originally filed this as a bug in Pandas (see [this 
ticket|https://github.com/pandas-dev/pandas/issues/25326]) but they punted me 
over here because the dtype conversion is handled in PyArrow. I'm not sure how 
you'd handle this here.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-8378) [Python] "empty" dtype metadata leads to wrong Parquet column type

Reply via email to