[ 
https://issues.apache.org/jira/browse/ARROW-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16192314#comment-16192314
 ] 

Wes McKinney commented on ARROW-1644:
-------------------------------------

At the moment we do not have mixed-nesting reading and writing implemented. If 
the nesting levels are all repeated (lists) or all groups (structs) vs. a mix 
(structs and lists/repeated fields) then we can read and write them. I recently 
wrote an important patch to help with this (PARQUET-1100 
https://github.com/apache/parquet-cpp/commit/4b09ac703bc75fee72f94bed8ecfe571096b04c1),
 but we could really use some help with the encoding and decoding of nested 
data. I will eventually get to it if no one else does but that could be anytime 
from 1 month from now to 6 months from now given how many other projects I have 
before me. 

> Parquet with nested structs can not be loaded in pyarrow in Oct 2017 nightly 
> build
> ----------------------------------------------------------------------------------
>
>                 Key: ARROW-1644
>                 URL: https://issues.apache.org/jira/browse/ARROW-1644
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.8.0
>            Reporter: DB Tsai
>
> We have many nested parquet files generated from Apache Spark for ranking 
> problems, and we would like to load them in python for other programs to 
> consume. 
> The schema looks like 
> {code:java}
> root
>  |-- profile_id: long (nullable = true)
>  |-- country_iso_code: string (nullable = true)
>  |-- items: array (nullable = false)
>  |    |-- element: struct (containsNull = false)
>  |    |    |-- show_title_id: integer (nullable = true)
>  |    |    |-- duration: double (nullable = true)
> {code}
> And when I tried to load it with nightly build pyarrow on Oct 4, 2017, I got 
> the following error.
> {code:python}
> Python 3.6.2 |Anaconda, Inc.| (default, Sep 30 2017, 18:42:57) 
> [GCC 7.2.0] on linux
> Type "help", "copyright", "credits" or "license" for more information.
> >>> import numpy as np
> >>> import pandas as pd
> >>> import pyarrow as pa
> >>> import pyarrow.parquet as pq
> >>> table2 = pq.read_table('part-00000')
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
>   File "/home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/parquet.py", 
> line 823, in read_table
>     use_pandas_metadata=use_pandas_metadata)
>   File "/home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/parquet.py", 
> line 119, in read
>     nthreads=nthreads)
>   File "_parquet.pyx", line 466, in pyarrow._parquet.ParquetReader.read_all
>   File "error.pxi", line 85, in pyarrow.lib.check_status
> pyarrow.lib.ArrowNotImplementedError: lists with structs are not supported.
> {code}
> I somehow get the impression that after 
> https://issues.apache.org/jira/browse/PARQUET-911 is merged, we should be 
> able to load the nested parquet in pyarrow. 
> Any insight about this? 
> Thanks.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to