[
https://issues.apache.org/jira/browse/ARROW-14488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17487900#comment-17487900
]
Alenka Frim commented on ARROW-14488:
-------------------------------------
Hi [~zijie0] , thank you for reporting! And sorry for a late reply.
I think this may be a bug on Arrow side: when constructing metadata in
_dataframe_to_types_ ({_}pandas_compat.py{_}) the conversion from empty Pandas
series to pa.array is wrong in the case of a string dtype. Here is an example:
{code:python}
>>> import pandas as pd
>>> import pyarrow as pa
>>> df = pd.DataFrame([['a', 1, 1.0]], columns=['a', 'b', 'c'])
>>> df
a b c
0 a 1 1.0
>>> df["a"]
0 a
Name: a, dtype: object
# Non-empty dataframe
>>> pa.array(df["a"], from_pandas=True) # Works for non-empty dataframe
<pyarrow.lib.StringArray object at 0x12462ed00>
[
"a"
]
>>> pa.array(df["a"], from_pandas=True).type
DataType(string)
# Empty dataframe
>>> pa.array(df["a"].head(0), from_pandas=True) # Becomes NullArray with no
>>> dtype in case of string/object
<pyarrow.lib.NullArray object at 0x12462eac0>
0 nulls
>>> pa.array(df["a"].head(0), from_pandas=True).type
DataType(null)
{code}
but that doesn't happen for integer or double:
{code:python}
>>> df["b"]
0 1
Name: b, dtype: int64
>>> pa.array(df["b"], from_pandas=True)
<pyarrow.lib.Int64Array object at 0x12462eac0>
[
1
]
>>> pa.array(df["b"].head(0), from_pandas=True)
<pyarrow.lib.Int64Array object at 0x12462ea60>
[]
>>> pa.array(df["b"].head(0), from_pandas=True).type
DataType(int64)
{code}
[~jorisvandenbossche] what do you think?
> [Python] Incorrect inferred schema from pandas dataframe with length 0.
> -----------------------------------------------------------------------
>
> Key: ARROW-14488
> URL: https://issues.apache.org/jira/browse/ARROW-14488
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Affects Versions: 5.0.0
> Environment: OS: Windows 10, CentOS 7
> Reporter: Yuan Zhou
> Priority: Major
>
> We use pandas(with pyarrow engine) to write out parquet files and those
> outputs will be consumed by other applications such as Java apps using
> org.apache.parquet.hadoop.ParquetFileReader. We found that some empty
> dataframes would get incorrect schema for string columns in other
> applications. After some investigation, we narrow down the issue to the
> schema inference by pyarrow:
> {code:java}
> In [1]: import pandas as pd
> In [2]: df = pd.DataFrame([['a', 1, 1.0]], columns=['a', 'b', 'c'])
> In [3]: import pyarrow as pa
> In [4]: pa.Schema.from_pandas(df)
> Out[4]:
> a: string
> b: int64
> c: double
> -- schema metadata --
> pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' +
> 562
> In [5]: pa.Schema.from_pandas(df.head(0))
> Out[5]:
> a: null
> b: int64
> c: double
> -- schema metadata --
> pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' +
> 560
> In [6]: pa._version_
> Out[6]: '5.0.0'
> {code}
> As you can see, the column 'a' which should be string type if inferred as
> null type and is converted to int32 while writing to parquet files.
> Is this an expected behavior? Or do we have any workaround for this issue?
> Could anyone take a look please. Thanks!
--
This message was sent by Atlassian Jira
(v8.20.1#820001)