Joris Van den Bossche created ARROW-8802: --------------------------------------------
Summary: [C++][Dataset] Schema metadata are lost when reading a subset of columns Key: ARROW-8802 URL: https://issues.apache.org/jira/browse/ARROW-8802 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Joris Van den Bossche Python example: {code} import pandas as pd import pyarrow.dataset as ds df = pd.DataFrame({'a': [1, 2, 3]}) df.to_parquet("test_metadata.parquet") dataset = ds.dataset("test_metadata.parquet") {code} gives: {code} >>> dataset.to_table().schema a: int64 -- field metadata -- PARQUET:field_id: '1' -- schema metadata -- pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 397 ARROW:schema: '/////4ACAAAQAAAAAAAKAA4ABgAFAAgACgAAAAABAwAQAAAAAAAKAAwAAA' + 806 >>> dataset.to_table(columns=['a']).schema a: int64 -- field metadata -- PARQUET:field_id: '1' {code} So when specifying a subset of the columns, the additional metadata entries are lost (while those can still be informative, eg for conversion to pandas) -- This message was sent by Atlassian Jira (v8.3.4#803005)