[
https://issues.apache.org/jira/browse/ARROW-9363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Joris Van den Bossche reassigned ARROW-9363:
--------------------------------------------
Assignee: Joris Van den Bossche
> [C++][Dataset] ParquetDatasetFactory schema: pandas metadata is lost
> --------------------------------------------------------------------
>
> Key: ARROW-9363
> URL: https://issues.apache.org/jira/browse/ARROW-9363
> Project: Apache Arrow
> Issue Type: Bug
> Components: C++
> Reporter: Joris Van den Bossche
> Assignee: Joris Van den Bossche
> Priority: Major
> Labels: dataset, dataset-dask-integration
> Fix For: 1.0.0
>
>
> When using the standard factory, the pandas metadata is included in the
> schema metadata of the dataset, but when using the ParquetDatasetFactory, it
> is not included:
> Using dask to write a small partitioned dataset with written {{_metadata}}
> file:
> {code:python}
> df = pd.DataFrame({"part": ["A", "A", "B", "B"], "col": [1, 2, 3, 4]})
>
>
> import dask.dataframe as dd
>
>
> ddf = dd.from_pandas(df, npartitions=2)
>
>
> ddf.to_parquet("test_parquet_pandas_metadata/", engine="pyarrow")
>
>
> {code}
> {code:python}
> In [9]: import pyarrow.dataset as ds
>
>
> # with ds.dataset -> pandas metadata included
> In [11]: ds.dataset("test_parquet_pandas_metadata/", format="parquet",
> partitioning="hive").schema
>
> Out[11]:
> part: string
> -- field metadata --
> PARQUET:field_id: '1'
> col: int64
> -- field metadata --
> PARQUET:field_id: '2'
> index: int64
> -- field metadata --
> PARQUET:field_id: '3'
> -- schema metadata --
> pandas: '{"index_columns": ["index"], "column_indexes": [{"name": null, "' +
> 558
> # with parquet_dataset -> pandas metadata not included
> In [14]: ds.parquet_dataset("test_parquet_pandas_metadata/_metadata",
> partitioning="hive").schema
>
> Out[14]:
> part: string
> -- field metadata --
> PARQUET:field_id: '1'
> col: int64
> -- field metadata --
> PARQUET:field_id: '2'
> index: int64
> -- field metadata --
> PARQUET:field_id: '3'
> # to show that the pandas metadata are present in the actual Parquet
> FileMetadata
> In [17]: pq.read_metadata("test_parquet_pandas_metadata/_metadata").metadata
>
>
> Out[17]:
> {b'ARROW:schema': b'/////4ADAAAQAAAAAAAKAA4AB...',
> b'pandas': b'{"index_columns": ["index"], ...'}
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)