Ged Steponavicius created ARROW-8251:
----------------------------------------

             Summary: [Python] pandas.ExtensionDtype does not survive round 
trip with write_to_dataset
                 Key: ARROW-8251
                 URL: https://issues.apache.org/jira/browse/ARROW-8251
             Project: Apache Arrow
          Issue Type: Bug
          Components: Python
    Affects Versions: 0.16.0
         Environment: pandas 1.0.1
parquet 0.16
            Reporter: Ged Steponavicius


write_to_dataset with pandas fields using pandas.ExtensionDtype nullable int or 
string produce parquet file which when read back in has different dtypes than 
original df
{code:java}
import pandas as pd 
import pyarrow as pa 
import pyarrow.parquet as pq 
parquet_dataset = 'partquet_dataset/' 
parquet_file = 'test.parquet' 

df = pd.DataFrame([{'str_col':'abc','int_col':1,'part':1}, 
{'str_col':np.nan,'int_col':np.nan,'part':1}]) 
df['str_col'] = df['str_col'].astype(pd.StringDtype()) 
df['int_col'] = df['int_col'].astype(pd.Int64Dtype()) 

table = pa.Table.from_pandas(df) 

pq.write_to_dataset(table, root_path=parquet_dataset, partition_cols=['part'] ) 
pq.write_table(table, where=parquet_file) {code}
write_table handles schema correctly, pandas.ExtensionDtype survive round trip:
{code:java}
pq.read_table(parquet_file).to_pandas().dtypes 
str_col string 
int_col Int64 
part int64 {code}
However, write_to_dataset reverts back to object/float:
{code:java}
pq.read_table(parquet_dataset).to_pandas().dtypes 
str_col object 
int_col float64 
part category {code}
I have also tried writing common metadata at the top-level directory of a 
partitioned dataset and then passing metadata to read_table, but results are 
the same as without metadata
{code:java}
pq.write_metadata(table.schema, parquet_dataset+'_common_metadata', 
version='2.0') meta = pq.read_metadata(parquet_dataset+'_common_metadata') 
pq.read_table(parquet_dataset,metadata=meta).to_pandas().dtypes {code}
This also affects pandas to_parquet when partition_cols is specified:
{code:java}
df.to_parquet(path = parquet_dataset, partition_cols=['part']) 
pd.read_parquet(parquet_dataset).dtypes 
str_col object 
int_col float64 
part category {code}
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to