Joris Van den Bossche created ARROW-8802:
--------------------------------------------
Summary: [C++][Dataset] Schema metadata are lost when reading a
subset of columns
Key: ARROW-8802
URL: https://issues.apache.org/jira/browse/ARROW-8802
Project: Apache Arrow
Issue Type: Bug
Components: C++
Reporter: Joris Van den Bossche
Python example:
{code}
import pandas as pd
import pyarrow.dataset as ds
df = pd.DataFrame({'a': [1, 2, 3]})
df.to_parquet("test_metadata.parquet")
dataset = ds.dataset("test_metadata.parquet")
{code}
gives:
{code}
>>> dataset.to_table().schema
a: int64
-- field metadata --
PARQUET:field_id: '1'
-- schema metadata --
pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 397
ARROW:schema: '/////4ACAAAQAAAAAAAKAA4ABgAFAAgACgAAAAABAwAQAAAAAAAKAAwAAA' + 806
>>> dataset.to_table(columns=['a']).schema
a: int64
-- field metadata --
PARQUET:field_id: '1'
{code}
So when specifying a subset of the columns, the additional metadata entries are
lost (while those can still be informative, eg for conversion to pandas)
--
This message was sent by Atlassian Jira
(v8.3.4#803005)