[ 
https://issues.apache.org/jira/browse/ARROW-8802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-8802:
-----------------------------------------
    Labels: dataset dataset-dask-integration  (was: dataset)

> [C++][Dataset] Schema metadata are lost when reading a subset of columns
> ------------------------------------------------------------------------
>
>                 Key: ARROW-8802
>                 URL: https://issues.apache.org/jira/browse/ARROW-8802
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++
>            Reporter: Joris Van den Bossche
>            Priority: Major
>              Labels: dataset, dataset-dask-integration
>
> Python example:
> {code}
> import pandas as pd     
> import pyarrow.dataset as ds                                                  
>                                                                               
>                                               
> df = pd.DataFrame({'a': [1, 2, 3]})  
> df.to_parquet("test_metadata.parquet")  
> dataset = ds.dataset("test_metadata.parquet")                                 
>                                                                               
>                                               
> {code}
> gives:
> {code}
> >>> dataset.to_table().schema 
> a: int64
>   -- field metadata --
>   PARQUET:field_id: '1'
> -- schema metadata --
> pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 
> 397
> ARROW:schema: '/////4ACAAAQAAAAAAAKAA4ABgAFAAgACgAAAAABAwAQAAAAAAAKAAwAAA' + 
> 806
> >>> dataset.to_table(columns=['a']).schema 
> a: int64
>   -- field metadata --
>   PARQUET:field_id: '1'
> {code}
> So when specifying a subset of the columns, the additional metadata entries 
> are lost (while those can still be informative, eg for conversion to pandas)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to