[
https://issues.apache.org/jira/browse/ARROW-6114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Joris Van den Bossche updated ARROW-6114:
-----------------------------------------
Summary: [Python] Datatypes are not preserved when a pandas dataframe
partitioned and saved as parquet file using pyarrow (was: Datatypes are not
preserved when a pandas dataframe partitioned and saved as parquet file using
pyarrow)
> [Python] Datatypes are not preserved when a pandas dataframe partitioned and
> saved as parquet file using pyarrow
> ----------------------------------------------------------------------------------------------------------------
>
> Key: ARROW-6114
> URL: https://issues.apache.org/jira/browse/ARROW-6114
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Affects Versions: 0.14.1
> Environment: Python 3.7.3
> pyarrow 0.14.1
> Reporter: Naga
> Priority: Major
> Labels: dataset, parquet
>
> h3. Datatypes are not preserved when a pandas data frame is *partitioned* and
> saved as parquet file using pyarrow but that's not the case when the data
> frame is not partitioned.
> *Case 1: Saving a partitioned dataset - Data Types are NOT preserved*
> {code:java}
> # Saving a Pandas Dataframe to Local as a partioned parquet file using pyarrow
> import pandas as pd
> df = pd.DataFrame( {'age': [77,32,234],'name':['agan','bbobby','test'] }
> )
> path = 'test'
> partition_cols=['age']
> print('Datatypes before saving the dataset')
> print(df.dtypes)
> table = pa.Table.from_pandas(df)
> pq.write_to_dataset(table, path, partition_cols=partition_cols,
> preserve_index=False)
> # Loading a dataset partioned parquet dataset from local
> df = pq.ParquetDataset(path, filesystem=None).read_pandas().to_pandas()
> print('\nDatatypes after loading the dataset')
> print(df.dtypes)
> {code}
> *Output:*
> {code:java}
> Datatypes before saving the dataset
> age int64
> name object
> dtype: object
> Datatypes after loading the dataset
> name object
> age category
> dtype: object
> {code}
> h5. {color:#d04437}From the above output, we could see that the data type for
> age is int64 in the original pandas data frame but it got changed to category
> when we saved to local and loaded back.{color}
> *Case 2: Non-partitioned dataset - Data types are preserved*
> {code:java}
> import pandas as pd
> print('Saving a Pandas Dataframe to Local as a parquet file without
> partitioning using pyarrow')
> df = pd.DataFrame(
> {'age': [77,32,234],'name':['agan','bbobby','test'] }
> )
> path = 'test_without_partition'
> print('Datatypes before saving the dataset')
> print(df.dtypes)
> table = pa.Table.from_pandas(df)
> pq.write_to_dataset(table, path, preserve_index=False)
> # Loading a non-partioned parquet file from local
> df = pq.ParquetDataset(path, filesystem=None).read_pandas().to_pandas()
> print('\nDatatypes after loading the dataset')
> print(df.dtypes)
> {code}
> *Output:*
> {code:java}
> Saving a Pandas Dataframe to Local as a parquet file without partitioning
> using pyarrow
> Datatypes before saving the dataset
> age int64
> name object
> dtype: object
> Datatypes after loading the dataset
> age int64
> name object
> dtype: object
> {code}
> *Versions*
> * Python 3.7.3
> * pyarrow 0.14.1
--
This message was sent by Atlassian JIRA
(v7.6.14#76016)