[
https://issues.apache.org/jira/browse/ARROW-1842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Young-Jun Ko closed ARROW-1842.
-------------------------------
Resolution: Duplicate
As pointed out in the comments, duplicate of
https://issues.apache.org/jira/browse/ARROW-1684
> ParquetDataset.read(): selectively reading array column
> -------------------------------------------------------
>
> Key: ARROW-1842
> URL: https://issues.apache.org/jira/browse/ARROW-1842
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Affects Versions: 0.7.1
> Reporter: Young-Jun Ko
>
> Scenario:
> - created a dataframe in spark and saved it as parquet
> - columns include simple types, e.g. String, but also an array of doubles
> Issue:
> I can read the whole data using ParquetDataset in pyarrow.
> I tried reading selectively a simple type => works
> I tried reading selectively the array column => key error in the following
> place:
> KeyError: 'c'
> /home/hadoop/Python/lib/python2.7/site-packages/pyarrow/_parquet.pyx in
> pyarrow._parquet.ParquetReader.column_name_idx
> (/arrow/python/build/temp.linux-x86_64-2.7/_parquet.cxx:9777)()
> 513 self.column_idx_map[col_bytes] = i
> 514
> --> 515 return self.column_idx_map[tobytes(column_name)]
> When I just read the whole dataset, I get the correct metadata
> pyarrow.Table
> a: string
> b: string
> c: list<element: double not null>
> child 0, element: double
> d: int64
> metadata
> --------
> {'org.apache.spark.sql.parquet.row.metadata':
> '{"type":"struct","fields":[{"name":"a","type":"string","nullable":true,"metadata":{}},{"name":"b","type":"string","nullable":true,"metadata":{}},{"name":"c","type":{"type":"array","elementType":"double","containsNull":false},"nullable":true,"metadata":{}},{"name":"d","type":"long","nullable":false,"metadata":{}}]}'}
> I might just be missing the correct naming convention of the array column.
> But then this name should be reflected in the metadata.
> Thanks!
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)