[jira] [Closed] (ARROW-1842) ParquetDataset.read(): selectively reading array column

Young-Jun Ko (JIRA) Wed, 22 Nov 2017 01:51:40 -0800

     [ 
https://issues.apache.org/jira/browse/ARROW-1842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Young-Jun Ko closed ARROW-1842.
-------------------------------
    Resolution: Duplicate

As pointed out in the comments, duplicate of 
https://issues.apache.org/jira/browse/ARROW-1684

> ParquetDataset.read(): selectively reading array column
> -------------------------------------------------------
>
>                 Key: ARROW-1842
>                 URL: https://issues.apache.org/jira/browse/ARROW-1842
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.7.1
>            Reporter: Young-Jun Ko
>
> Scenario:
> - created a dataframe in spark and saved it as parquet
> - columns include simple types, e.g. String, but also an array of doubles
> Issue:
> I can read the whole data using ParquetDataset in pyarrow.
> I tried reading selectively a simple type => works
> I tried reading selectively the array column => key error in the following 
> place:
> KeyError: 'c'
> /home/hadoop/Python/lib/python2.7/site-packages/pyarrow/_parquet.pyx in 
> pyarrow._parquet.ParquetReader.column_name_idx 
> (/arrow/python/build/temp.linux-x86_64-2.7/_parquet.cxx:9777)()
>     513                 self.column_idx_map[col_bytes] = i
>     514 
> --> 515         return self.column_idx_map[tobytes(column_name)]
> When I just read the whole dataset, I get the correct metadata
> pyarrow.Table
> a: string
> b: string
> c: list<element: double not null>
>   child 0, element: double
> d: int64
> metadata
> --------
> {'org.apache.spark.sql.parquet.row.metadata': 
> '{"type":"struct","fields":[{"name":"a","type":"string","nullable":true,"metadata":{}},{"name":"b","type":"string","nullable":true,"metadata":{}},{"name":"c","type":{"type":"array","elementType":"double","containsNull":false},"nullable":true,"metadata":{}},{"name":"d","type":"long","nullable":false,"metadata":{}}]}'}
> I might just be missing the correct naming convention of the array column.
> But then this name should be reflected in the metadata.
> Thanks!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Closed] (ARROW-1842) ParquetDataset.read(): selectively reading array column

Reply via email to