[ https://issues.apache.org/jira/browse/ARROW-6114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Naga updated ARROW-6114: ------------------------ Description: h3. Datatypes are not preserved when a pandas data frame is *partitioned* and saved as parquet file using pyarrow but that's not the case when the data frame is not partitioned. *Case 1: Saving a partitioned dataset - Data Types are NOT preserved* {code:java} # Saving a Pandas Dataframe to Local as a partioned parquet file using pyarrow import pandas as pd df = pd.DataFrame( \{'age': [77,32,234],'name':['agan','bbobby','test'] } ) path = 'test' partition_cols=['age'] print('Datatypes before saving the dataset') print(df.dtypes) table = pa.Table.from_pandas(df) pq.write_to_dataset(table, path, partition_cols=partition_cols, preserve_index=False) # Loading a dataset partioned parquet dataset from local df = pq.ParquetDataset(path, filesystem=None).read_pandas().to_pandas() print('\nDatatypes after loading the dataset') print(df.dtypes) {code} *Output:* {code:java} Datatypes before saving the dataset age int64 name object dtype: object Datatypes after loading the dataset name object age category dtype: object {code} h4. From the above output, we could see that the data type for age is int64 in the original pandas data frame but it got changed to object when we saved to local and loaded back. *Case 2: Non-partitioned dataset - Data types are preserved* {code:java} import pandas as pd print('Saving a Pandas Dataframe to Local as a parquet file without partitioning using pyarrow') df = pd.DataFrame( {'age': [77,32,234],'name':['agan','bbobby','test'] } ) path = 'test_without_partition' print('Datatypes before saving the dataset') print(df.dtypes) table = pa.Table.from_pandas(df) pq.write_to_dataset(table, path, preserve_index=False) # Loading a non-partioned parquet file from local df = pq.ParquetDataset(path, filesystem=None).read_pandas().to_pandas() print('\nDatatypes after loading the dataset') print(df.dtypes) {code} *Output:* {code:java} Saving a Pandas Dataframe to Local as a parquet file without partitioning using pyarrow Datatypes before saving the dataset age int64 name object dtype: object Datatypes after loading the dataset age int64 name object dtype: object {code} *Versions* * Python 3.7.3 * pyarrow 0.14.1 was: h3. Datatypes are not preserved when a pandas data frame is *partitioned* and saved as parquet file using pyarrow but that's not the case when the data frame is not partitioned. *Case 1: Saving a partitioned dataset - Data Types are NOT preserved* {code:java} # Saving a Pandas Dataframe to Local as a partioned parquet file using pyarrow import pandas as pd df = pd.DataFrame( \{'age': [77,32,234],'name':['agan','bbobby','test'] } ) path = 'test' partition_cols=['age'] print('Datatypes before saving the dataset') print(df.dtypes) table = pa.Table.from_pandas(df) pq.write_to_dataset(table, path, partition_cols=partition_cols, preserve_index=False) # Loading a dataset partioned parquet dataset from local df = pq.ParquetDataset(path, filesystem=None).read_pandas().to_pandas() print('\nDatatypes after loading the dataset') print(df.dtypes) {code} *Output:* {code:java} Datatypes before saving the dataset age int64 name object dtype: object Datatypes after loading the dataset name object age category dtype: object {code} >From the above output, we could see that the data type for age is int64 in the >original pandas data frame but it got changed to object when we saved to local >and loaded back. *Case 2: Non-partitioned dataset - Data types are preserved* {code:java} import pandas as pd print('Saving a Pandas Dataframe to Local as a parquet file without partitioning using pyarrow') df = pd.DataFrame( {'age': [77,32,234],'name':['agan','bbobby','test'] } ) path = 'test_without_partition' print('Datatypes before saving the dataset') print(df.dtypes) table = pa.Table.from_pandas(df) pq.write_to_dataset(table, path, preserve_index=False) # Loading a non-partioned parquet file from local df = pq.ParquetDataset(path, filesystem=None).read_pandas().to_pandas() print('\nDatatypes after loading the dataset') print(df.dtypes) {code} *Output:* {code:java} Saving a Pandas Dataframe to Local as a parquet file without partitioning using pyarrow Datatypes before saving the dataset age int64 name object dtype: object Datatypes after loading the dataset age int64 name object dtype: object {code} *Versions* * Python 3.7.3 * pyarrow 0.14.1 > Datatypes are not preserved when a pandas dataframe partitioned and saved as > parquet file using pyarrow > ------------------------------------------------------------------------------------------------------- > > Key: ARROW-6114 > URL: https://issues.apache.org/jira/browse/ARROW-6114 > Project: Apache Arrow > Issue Type: Bug > Components: Python > Affects Versions: 0.14.1 > Environment: Python 3.7.3 > pyarrow 0.14.1 > Reporter: Naga > Priority: Major > > h3. Datatypes are not preserved when a pandas data frame is *partitioned* and > saved as parquet file using pyarrow but that's not the case when the data > frame is not partitioned. > *Case 1: Saving a partitioned dataset - Data Types are NOT preserved* > {code:java} > # Saving a Pandas Dataframe to Local as a partioned parquet file using pyarrow > import pandas as pd > df = pd.DataFrame( \{'age': [77,32,234],'name':['agan','bbobby','test'] } > ) > path = 'test' > partition_cols=['age'] > print('Datatypes before saving the dataset') > print(df.dtypes) > table = pa.Table.from_pandas(df) > pq.write_to_dataset(table, path, partition_cols=partition_cols, > preserve_index=False) > # Loading a dataset partioned parquet dataset from local > df = pq.ParquetDataset(path, filesystem=None).read_pandas().to_pandas() > print('\nDatatypes after loading the dataset') > print(df.dtypes) > {code} > *Output:* > {code:java} > Datatypes before saving the dataset > age int64 > name object > dtype: object > Datatypes after loading the dataset > name object > age category > dtype: object > {code} > h4. From the above output, we could see that the data type for age is int64 > in the original pandas data frame but it got changed to object when we saved > to local and loaded back. > *Case 2: Non-partitioned dataset - Data types are preserved* > {code:java} > import pandas as pd > print('Saving a Pandas Dataframe to Local as a parquet file without > partitioning using pyarrow') > df = pd.DataFrame( > {'age': [77,32,234],'name':['agan','bbobby','test'] } > ) > path = 'test_without_partition' > print('Datatypes before saving the dataset') > print(df.dtypes) > table = pa.Table.from_pandas(df) > pq.write_to_dataset(table, path, preserve_index=False) > # Loading a non-partioned parquet file from local > df = pq.ParquetDataset(path, filesystem=None).read_pandas().to_pandas() > print('\nDatatypes after loading the dataset') > print(df.dtypes) > {code} > *Output:* > {code:java} > Saving a Pandas Dataframe to Local as a parquet file without partitioning > using pyarrow > Datatypes before saving the dataset > age int64 > name object > dtype: object > Datatypes after loading the dataset > age int64 > name object > dtype: object > {code} > *Versions* > * Python 3.7.3 > * pyarrow 0.14.1 -- This message was sent by Atlassian JIRA (v7.6.14#76016)