[
https://issues.apache.org/jira/browse/ARROW-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17126918#comment-17126918
]
Wes McKinney commented on ARROW-1644:
-------------------------------------
Thanks. We should follow up on the mailing list discussion and see what is the
latest game plan for implementing the Parquet nested reader. Some of my
colleagues should be able to help
> [C++][Parquet] Read and write nested Parquet data with a mix of struct and
> list nesting levels
> ----------------------------------------------------------------------------------------------
>
> Key: ARROW-1644
> URL: https://issues.apache.org/jira/browse/ARROW-1644
> Project: Apache Arrow
> Issue Type: New Feature
> Components: C++, Python
> Affects Versions: 0.8.0
> Reporter: DB Tsai
> Assignee: Micah Kornfield
> Priority: Major
> Labels: parquet, pull-request-available
>
> We have many nested parquet files generated from Apache Spark for ranking
> problems, and we would like to load them in python for other programs to
> consume.
> The schema looks like
> {code:java}
> root
> |-- profile_id: long (nullable = true)
> |-- country_iso_code: string (nullable = true)
> |-- items: array (nullable = false)
> | |-- element: struct (containsNull = false)
> | | |-- show_title_id: integer (nullable = true)
> | | |-- duration: double (nullable = true)
> {code}
> And when I tried to load it with nightly build pyarrow on Oct 4, 2017, I got
> the following error.
> {code:python}
> Python 3.6.2 |Anaconda, Inc.| (default, Sep 30 2017, 18:42:57)
> [GCC 7.2.0] on linux
> Type "help", "copyright", "credits" or "license" for more information.
> >>> import numpy as np
> >>> import pandas as pd
> >>> import pyarrow as pa
> >>> import pyarrow.parquet as pq
> >>> table2 = pq.read_table('part-00000')
> Traceback (most recent call last):
> File "<stdin>", line 1, in <module>
> File "/home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/parquet.py",
> line 823, in read_table
> use_pandas_metadata=use_pandas_metadata)
> File "/home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/parquet.py",
> line 119, in read
> nthreads=nthreads)
> File "_parquet.pyx", line 466, in pyarrow._parquet.ParquetReader.read_all
> File "error.pxi", line 85, in pyarrow.lib.check_status
> pyarrow.lib.ArrowNotImplementedError: lists with structs are not supported.
> {code}
> I somehow get the impression that after
> https://issues.apache.org/jira/browse/PARQUET-911 is merged, we should be
> able to load the nested parquet in pyarrow.
> Any insight about this?
> Thanks.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)