[ https://issues.apache.org/jira/browse/ARROW-8802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Joris Van den Bossche updated ARROW-8802: ----------------------------------------- Labels: dataset dataset-dask-integration (was: dataset) > [C++][Dataset] Schema metadata are lost when reading a subset of columns > ------------------------------------------------------------------------ > > Key: ARROW-8802 > URL: https://issues.apache.org/jira/browse/ARROW-8802 > Project: Apache Arrow > Issue Type: Bug > Components: C++ > Reporter: Joris Van den Bossche > Priority: Major > Labels: dataset, dataset-dask-integration > > Python example: > {code} > import pandas as pd > import pyarrow.dataset as ds > > > df = pd.DataFrame({'a': [1, 2, 3]}) > df.to_parquet("test_metadata.parquet") > dataset = ds.dataset("test_metadata.parquet") > > > {code} > gives: > {code} > >>> dataset.to_table().schema > a: int64 > -- field metadata -- > PARQUET:field_id: '1' > -- schema metadata -- > pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + > 397 > ARROW:schema: '/////4ACAAAQAAAAAAAKAA4ABgAFAAgACgAAAAABAwAQAAAAAAAKAAwAAA' + > 806 > >>> dataset.to_table(columns=['a']).schema > a: int64 > -- field metadata -- > PARQUET:field_id: '1' > {code} > So when specifying a subset of the columns, the additional metadata entries > are lost (while those can still be informative, eg for conversion to pandas) -- This message was sent by Atlassian Jira (v8.3.4#803005)