[
https://issues.apache.org/jira/browse/ARROW-9009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Wes McKinney resolved ARROW-9009.
---------------------------------
Resolution: Fixed
Resolved in ARROW-8980
> [C++][Dataset] ARROW:schema should be removed from schema's metadata when
> reading Parquet files
> -----------------------------------------------------------------------------------------------
>
> Key: ARROW-9009
> URL: https://issues.apache.org/jira/browse/ARROW-9009
> Project: Apache Arrow
> Issue Type: Bug
> Components: C++
> Reporter: Joris Van den Bossche
> Assignee: Wes McKinney
> Priority: Major
> Labels: dataset
> Fix For: 1.0.0
>
>
> When reading a parquet file (which was written by Arrow) with the datasets
> API, it preserves the "ARROW:schema" field in the metadata:
> {code:python}
> import pyarrow as pa
> import pyarrow.parquet as pq
> import pyarrow.dataset as ds
> table = pa.table({'a': [1, 2, 3]})
> pq.write_table(table, "test.parquet")
> dataset = ds.dataset("test.parquet", format="parquet")
> {code}
> {code}
> In [7]: dataset.schema
>
>
> Out[7]:
> a: int64
> -- field metadata --
> PARQUET:field_id: '1'
> -- schema metadata --
> ARROW:schema: '/////3gAAAAQAAAAAAAKAAwABgAFAAgACgAAAAABAwAMAAAACAAIAAAABA' +
> 114
> In [8]: dataset.to_table().schema
>
>
> Out[8]:
> a: int64
> -- field metadata --
> PARQUET:field_id: '1'
> -- schema metadata --
> ARROW:schema: '/////3gAAAAQAAAAAAAKAAwABgAFAAgACgAAAAABAwAMAAAACAAIAAAABA' +
> 114
> {code}
> while when reading with the `parquet` module reader, we do not preserve this
> metadata:
> {code}
> In [9]: pq.read_table("test.parquet").schema
>
>
> Out[9]:
> a: int64
> -- field metadata --
> PARQUET:field_id: '1'
> {code}
> Since the "ARROW:schema" information is used to properly reconstruct the
> Arrow schema from the ParquetSchema, it is no longer needed once you already
> have the arrow schema, so it's probably not needed to keep it as metadata in
> the arrow schema.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)