[
https://issues.apache.org/jira/browse/ARROW-14488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488402#comment-17488402
]
Alenka Frim commented on ARROW-14488:
-------------------------------------
Thank you Joris!
An example would be:
{code:python}
>>> import pandas as pd
>>> df = pd.DataFrame([['a', 1, 1.0]], columns=['a', 'b', 'c'])
>>> import pyarrow as pa
>>>
>>> schema = pa.schema([
... pa.field('a', pa.string()),
... pa.field('b', pa.int64()),
... pa.field('c', pa.float64())])
>>>
>>> pa.Table.from_pandas(df, schema=schema)
pyarrow.Table
a: string
b: int64
c: double
----
a: [["a"]]
b: [[1]]
c: [[1]]
>>> pa.Table.from_pandas(df.head(0), schema=schema)
pyarrow.Table
a: string
b: int64
c: double
----
a: [[]]
b: [[]]
c: [[]]
{code}
> [Python] Incorrect inferred schema from pandas dataframe with length 0.
> -----------------------------------------------------------------------
>
> Key: ARROW-14488
> URL: https://issues.apache.org/jira/browse/ARROW-14488
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Affects Versions: 5.0.0
> Environment: OS: Windows 10, CentOS 7
> Reporter: Yuan Zhou
> Priority: Major
>
> We use pandas(with pyarrow engine) to write out parquet files and those
> outputs will be consumed by other applications such as Java apps using
> org.apache.parquet.hadoop.ParquetFileReader. We found that some empty
> dataframes would get incorrect schema for string columns in other
> applications. After some investigation, we narrow down the issue to the
> schema inference by pyarrow:
> {code:java}
> In [1]: import pandas as pd
> In [2]: df = pd.DataFrame([['a', 1, 1.0]], columns=['a', 'b', 'c'])
> In [3]: import pyarrow as pa
> In [4]: pa.Schema.from_pandas(df)
> Out[4]:
> a: string
> b: int64
> c: double
> -- schema metadata --
> pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' +
> 562
> In [5]: pa.Schema.from_pandas(df.head(0))
> Out[5]:
> a: null
> b: int64
> c: double
> -- schema metadata --
> pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' +
> 560
> In [6]: pa._version_
> Out[6]: '5.0.0'
> {code}
> As you can see, the column 'a' which should be string type if inferred as
> null type and is converted to int32 while writing to parquet files.
> Is this an expected behavior? Or do we have any workaround for this issue?
> Could anyone take a look please. Thanks!
--
This message was sent by Atlassian Jira
(v8.20.1#820001)