Ged Steponavicius created ARROW-8251: ----------------------------------------
Summary: [Python] pandas.ExtensionDtype does not survive round trip with write_to_dataset Key: ARROW-8251 URL: https://issues.apache.org/jira/browse/ARROW-8251 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.16.0 Environment: pandas 1.0.1 parquet 0.16 Reporter: Ged Steponavicius write_to_dataset with pandas fields using pandas.ExtensionDtype nullable int or string produce parquet file which when read back in has different dtypes than original df {code:java} import pandas as pd import pyarrow as pa import pyarrow.parquet as pq parquet_dataset = 'partquet_dataset/' parquet_file = 'test.parquet' df = pd.DataFrame([{'str_col':'abc','int_col':1,'part':1}, {'str_col':np.nan,'int_col':np.nan,'part':1}]) df['str_col'] = df['str_col'].astype(pd.StringDtype()) df['int_col'] = df['int_col'].astype(pd.Int64Dtype()) table = pa.Table.from_pandas(df) pq.write_to_dataset(table, root_path=parquet_dataset, partition_cols=['part'] ) pq.write_table(table, where=parquet_file) {code} write_table handles schema correctly, pandas.ExtensionDtype survive round trip: {code:java} pq.read_table(parquet_file).to_pandas().dtypes str_col string int_col Int64 part int64 {code} However, write_to_dataset reverts back to object/float: {code:java} pq.read_table(parquet_dataset).to_pandas().dtypes str_col object int_col float64 part category {code} I have also tried writing common metadata at the top-level directory of a partitioned dataset and then passing metadata to read_table, but results are the same as without metadata {code:java} pq.write_metadata(table.schema, parquet_dataset+'_common_metadata', version='2.0') meta = pq.read_metadata(parquet_dataset+'_common_metadata') pq.read_table(parquet_dataset,metadata=meta).to_pandas().dtypes {code} This also affects pandas to_parquet when partition_cols is specified: {code:java} df.to_parquet(path = parquet_dataset, partition_cols=['part']) pd.read_parquet(parquet_dataset).dtypes str_col object int_col float64 part category {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)