[ 
https://issues.apache.org/jira/browse/ARROW-8378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17079297#comment-17079297
 ] 

Joris Van den Bossche commented on ARROW-8378:
----------------------------------------------

[~yiannisliodakis] I think the "bug" lies in the expectation of what your 
pandas code is doing. The two dataframes that you create that way are actually 
not equal:
{code:java}
In [30]: df_1.col.values 
Out[30]: array(['None', 'None', 'None'], dtype=object)

In [31]: df_2.col.values   
Out[31]: array([None, None, None], dtype=object)
{code}
Where you can see that the first is actually a column with strings 
({{"None"}}), while the second is an array with only "nulls" ({{None}}).

So from pyarrow's point of view, it is doing the expected thing: in the first 
case it creates a string column (because it gets passed strings), in the second 
case there are only nulls, and thus pyarrow decides to use a {{null}} type. 
 As [~wesm] said, if you want to ensure correct types also for corner cases of 
all missing values, you need to pass an explicit expected schema (to do this, 
you will first need to create a pyarrow table and write that to parquet, as I 
think it is currently not possible through the pandas API).

The ultimate reason for this is the limitation of pandas' dtype system. 
Although you do {{dtype=np.unicode_}}, pandas does not support that data type, 
and will generally store those as "object" dtype. But an object dtype can 
contain anything, and thus the conversion of an object dtype column to pyarrow 
depends on the content of the column.

Note that in the latest version of pandas, there is a "string" dtype, which 
will always be converted to string, even if it contains all missing values (see 
[https://pandas.pydata.org/docs/whatsnew/v1.0.0.html#dedicated-string-data-type])

> [Python] "empty" dtype metadata leads to wrong Parquet column type
> ------------------------------------------------------------------
>
>                 Key: ARROW-8378
>                 URL: https://issues.apache.org/jira/browse/ARROW-8378
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.16.0
>         Environment: Python: 3.7.6
> Pandas: 0.24.1, 0.25.3, 1.0.3
> Pyarrow: 0.16.0
> OS: OSX 10.15.3
>            Reporter: Diego Argueta
>            Priority: Major
>
> Run the following code with Pandas 0.24.x-1.0.x, and PyArrow 0.16.0 on Python 
> 3.7:
> {code:python}
> import pandas as pd
> import numpy as np
> df_1 = pd.DataFrame({'col': [None, None, None]})
> df_1.col = df_1.col.astype(np.unicode_)
> df_1.to_parquet('right.parq', engine='pyarrow')
> series = pd.Series([None, None, None], dtype=np.unicode_)
> df_2 = pd.DataFrame({'col': series})
> df_2.to_parquet('wrong.parq', engine='pyarrow')
> {code}
> Examine the Parquet column type for each file (I use 
> [parquet-tools|https://github.com/wesleypeck/parquet-tools]). {{right.parq}} 
> has the expected UTF-8 string type. {{wrong.parq}} has an {{INT32}}.
> The following metadata is stored in the Parquet files:
> {{right.parq}}
> {code:json}
> {
>   "column_indexes": [],
>   "columns": [
>     {
>       "field_name": "col",
>       "metadata": null,
>       "name": "col",
>       "numpy_type": "object",
>       "pandas_type": "unicode"
>     }
>   ],
>   "index_columns": [],
>   "pandas_version": "0.24.1"
> }
> {code}
> {{wrong.parq}}
> {code:json}
> {
>   "column_indexes": [],
>   "columns": [
>     {
>       "field_name": "col",
>       "metadata": null,
>       "name": "col",
>       "numpy_type": "object",
>       "pandas_type": "empty"
>     }
>   ],
>   "index_columns": [],
>   "pandas_version": "0.24.1"
> }
> {code}
> The difference between the two is that the {{pandas_type}} for the incorrect 
> file is "empty" rather than the expected "unicode". PyArrow misinterprets 
> this and defaults to a 32-bit integer column.
> The incorrect datatype will cause Redshift to reject the file when we try to 
> read it because the column type in the file doesn't match the column type in 
> the database table.
> I originally filed this as a bug in Pandas (see [this 
> ticket|https://github.com/pandas-dev/pandas/issues/25326]) but they punted me 
> over here because the dtype conversion is handled in PyArrow. I'm not sure 
> how you'd handle this here.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to