[ 
https://issues.apache.org/jira/browse/ARROW-6114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-6114:
-----------------------------------------
    Summary: [Python] Datatypes are not preserved when a pandas dataframe 
partitioned and saved as parquet file using pyarrow  (was: Datatypes are not 
preserved when a pandas dataframe partitioned and saved as parquet file using 
pyarrow)

> [Python] Datatypes are not preserved when a pandas dataframe partitioned and 
> saved as parquet file using pyarrow
> ----------------------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-6114
>                 URL: https://issues.apache.org/jira/browse/ARROW-6114
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.14.1
>         Environment: Python 3.7.3
> pyarrow 0.14.1
>            Reporter: Naga
>            Priority: Major
>              Labels: dataset, parquet
>
> h3. Datatypes are not preserved when a pandas data frame is *partitioned* and 
> saved as parquet file using pyarrow but that's not the case when the data 
> frame is not partitioned.
> *Case 1: Saving a partitioned dataset - Data Types are NOT preserved*
> {code:java}
> # Saving a Pandas Dataframe to Local as a partioned parquet file using pyarrow
> import pandas as pd
> df = pd.DataFrame( {'age': [77,32,234],'name':['agan','bbobby','test'] }
> )
> path = 'test'
> partition_cols=['age']
> print('Datatypes before saving the dataset')
> print(df.dtypes)
> table = pa.Table.from_pandas(df)
> pq.write_to_dataset(table, path, partition_cols=partition_cols, 
> preserve_index=False)
>  # Loading a dataset partioned parquet dataset from local
> df = pq.ParquetDataset(path, filesystem=None).read_pandas().to_pandas()
> print('\nDatatypes after loading the dataset')
> print(df.dtypes)
> {code}
> *Output:*
> {code:java}
> Datatypes before saving the dataset
> age int64
> name object
> dtype: object
> Datatypes after loading the dataset
> name object
> age category
> dtype: object
> {code}
> h5. {color:#d04437}From the above output, we could see that the data type for 
> age is int64 in the original pandas data frame but it got changed to category 
> when we saved to local and loaded back.{color}
> *Case 2: Non-partitioned dataset - Data types are preserved*
> {code:java}
> import pandas as pd
> print('Saving a Pandas Dataframe to Local as a parquet file without 
> partitioning using pyarrow')
> df = pd.DataFrame(
> {'age': [77,32,234],'name':['agan','bbobby','test'] }
> )
> path = 'test_without_partition'
> print('Datatypes before saving the dataset')
> print(df.dtypes)
> table = pa.Table.from_pandas(df)
> pq.write_to_dataset(table, path, preserve_index=False)
>  # Loading a non-partioned parquet file from local
> df = pq.ParquetDataset(path, filesystem=None).read_pandas().to_pandas()
> print('\nDatatypes after loading the dataset')
> print(df.dtypes)
> {code}
> *Output:*
> {code:java}
> Saving a Pandas Dataframe to Local as a parquet file without partitioning 
> using pyarrow
> Datatypes before saving the dataset
> age int64
> name object
> dtype: object
> Datatypes after loading the dataset
> age int64
> name object
> dtype: object
> {code}
> *Versions*
>  * Python 3.7.3
>  * pyarrow 0.14.1



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

Reply via email to