[jira] [Commented] (ARROW-14488) [Python] Incorrect inferred schema from pandas dataframe with length 0.

Joris Van den Bossche (Jira) Mon, 07 Feb 2022 08:30:05 -0800


    [ 
https://issues.apache.org/jira/browse/ARROW-14488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488237#comment-17488237
 ]


Joris Van den Bossche commented on ARROW-14488:
-----------------------------------------------

bq. the conversion from empty Pandas series to pa.array is wrong in the case of 
a string dtype.

The main problem is that the example code is not using a "string dtype". By 
default, pandas uses the generic "object" dtype to store strings. But this data 
type basically means that it can hold _any_ Python object. So it is not 
guaranteed to be strings (eg it could also be decimals, bytes, .., for some 
python types that pyarrow also infers). 

As long as the array is not empty, the conversion to a pyarrow array will try 
to infer the appropriate type based on the values in the input array (eg in 
case of an object dtype array with strings, it will indeed convert that to a 
{{pa.string()}} type). But if the array is empty, there are no values to infer 
the type from. And that is the reason why pyarrow defaults to use the generic 
"null" data type for such array (or column in a DataFrame).

If you know that you have strings for a certain column, and want the 
pandas->pyarrow conversion to robustly work (regardless of having empty 
dataframes/arrays), the {{from_pandas}} method has a {{schema}} argument, and 
this way you can specific a schema to use (and so pyarrow will not try to infer 
the types based on the values in the array). You will have to construct this 
schema manually, though, in this case. 



> [Python] Incorrect inferred schema from pandas dataframe with length 0.
> -----------------------------------------------------------------------
>
>                 Key: ARROW-14488
>                 URL: https://issues.apache.org/jira/browse/ARROW-14488
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 5.0.0
>         Environment: OS: Windows 10, CentOS 7
>            Reporter: Yuan Zhou
>            Priority: Major
>
> We use pandas(with pyarrow engine) to write out parquet files and those 
> outputs will be consumed by other applications such as Java apps using 
> org.apache.parquet.hadoop.ParquetFileReader. We found that some empty 
> dataframes would get incorrect schema for string columns in other 
> applications. After some investigation, we narrow down the issue to the 
> schema inference by pyarrow:
> {code:java}
> In [1]: import pandas as pd
> In [2]: df = pd.DataFrame([['a', 1, 1.0]], columns=['a', 'b', 'c'])
> In [3]: import pyarrow as pa
> In [4]: pa.Schema.from_pandas(df)
>  Out[4]:
>  a: string
>  b: int64
>  c: double
>  -- schema metadata --
>  pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 
> 562
> In [5]: pa.Schema.from_pandas(df.head(0))
>  Out[5]:
>  a: null
>  b: int64
>  c: double
>  -- schema metadata --
>  pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 
> 560
> In [6]: pa._version_
>  Out[6]: '5.0.0'
> {code}
>  As you can see, the column 'a' which should be string type if inferred as 
> null type and is converted to int32 while writing to parquet files.
> Is this an expected behavior? Or do we have any workaround for this issue? 
> Could anyone take a look please. Thanks!



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (ARROW-14488) [Python] Incorrect inferred schema from pandas dataframe with length 0.

Reply via email to