[ 
https://issues.apache.org/jira/browse/ARROW-8251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-8251:
-----------------------------------------
    Fix Version/s: 1.0.0

> [Python] pandas.ExtensionDtype does not survive round trip with 
> write_to_dataset
> --------------------------------------------------------------------------------
>
>                 Key: ARROW-8251
>                 URL: https://issues.apache.org/jira/browse/ARROW-8251
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.16.0
>         Environment: pandas 1.0.1
> parquet 0.16
>            Reporter: Ged Steponavicius
>            Assignee: Joris Van den Bossche
>            Priority: Major
>             Fix For: 1.0.0
>
>
> write_to_dataset with pandas fields using pandas.ExtensionDtype nullable int 
> or string produce parquet file which when read back in has different dtypes 
> than original df
> {code:java}
> import pandas as pd 
> import pyarrow as pa 
> import pyarrow.parquet as pq 
> parquet_dataset = 'partquet_dataset/' 
> parquet_file = 'test.parquet' 
> df = pd.DataFrame([{'str_col':'abc','int_col':1,'part':1}, 
> {'str_col':np.nan,'int_col':np.nan,'part':1}]) 
> df['str_col'] = df['str_col'].astype(pd.StringDtype()) 
> df['int_col'] = df['int_col'].astype(pd.Int64Dtype()) 
> table = pa.Table.from_pandas(df) 
> pq.write_to_dataset(table, root_path=parquet_dataset, partition_cols=['part'] 
> ) pq.write_table(table, where=parquet_file) {code}
> write_table handles schema correctly, pandas.ExtensionDtype survive round 
> trip:
> {code:java}
> pq.read_table(parquet_file).to_pandas().dtypes 
> str_col string 
> int_col Int64 
> part int64 {code}
> However, write_to_dataset reverts back to object/float:
> {code:java}
> pq.read_table(parquet_dataset).to_pandas().dtypes 
> str_col object 
> int_col float64 
> part category {code}
> I have also tried writing common metadata at the top-level directory of a 
> partitioned dataset and then passing metadata to read_table, but results are 
> the same as without metadata
> {code:java}
> pq.write_metadata(table.schema, parquet_dataset+'_common_metadata', 
> version='2.0') meta = pq.read_metadata(parquet_dataset+'_common_metadata') 
> pq.read_table(parquet_dataset,metadata=meta).to_pandas().dtypes {code}
> This also affects pandas to_parquet when partition_cols is specified:
> {code:java}
> df.to_parquet(path = parquet_dataset, partition_cols=['part']) 
> pd.read_parquet(parquet_dataset).dtypes 
> str_col object 
> int_col float64 
> part category {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to