[jira] [Commented] (ARROW-1644) [C++][Parquet] Read and write nested Parquet data with a mix of struct and list nesting levels

Micah Kornfield (Jira) Thu, 07 Nov 2019 17:30:26 -0800


    [ 
https://issues.apache.org/jira/browse/ARROW-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16969712#comment-16969712
 ]


Micah Kornfield commented on ARROW-1644:
----------------------------------------

The code isn't really super useable since it is based on the old repo and a lot 
of changes have been made (and it had a performance regression).  I haven't had 
time to work on this, but still hope to get some bandwidth in the next month or 
so.  But if there are motivated parties I'm happy to remove my name from the 
assignment.

> [C++][Parquet] Read and write nested Parquet data with a mix of struct and 
> list nesting levels
> ----------------------------------------------------------------------------------------------
>
>                 Key: ARROW-1644
>                 URL: https://issues.apache.org/jira/browse/ARROW-1644
>             Project: Apache Arrow
>          Issue Type: New Feature
>          Components: C++, Python
>    Affects Versions: 0.8.0
>            Reporter: DB Tsai
>            Assignee: Micah Kornfield
>            Priority: Major
>              Labels: parquet, pull-request-available
>             Fix For: 1.0.0
>
>
> We have many nested parquet files generated from Apache Spark for ranking 
> problems, and we would like to load them in python for other programs to 
> consume. 
> The schema looks like 
> {code:java}
> root
>  |-- profile_id: long (nullable = true)
>  |-- country_iso_code: string (nullable = true)
>  |-- items: array (nullable = false)
>  |    |-- element: struct (containsNull = false)
>  |    |    |-- show_title_id: integer (nullable = true)
>  |    |    |-- duration: double (nullable = true)
> {code}
> And when I tried to load it with nightly build pyarrow on Oct 4, 2017, I got 
> the following error.
> {code:python}
> Python 3.6.2 |Anaconda, Inc.| (default, Sep 30 2017, 18:42:57) 
> [GCC 7.2.0] on linux
> Type "help", "copyright", "credits" or "license" for more information.
> >>> import numpy as np
> >>> import pandas as pd
> >>> import pyarrow as pa
> >>> import pyarrow.parquet as pq
> >>> table2 = pq.read_table('part-00000')
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
>   File "/home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/parquet.py", 
> line 823, in read_table
>     use_pandas_metadata=use_pandas_metadata)
>   File "/home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/parquet.py", 
> line 119, in read
>     nthreads=nthreads)
>   File "_parquet.pyx", line 466, in pyarrow._parquet.ParquetReader.read_all
>   File "error.pxi", line 85, in pyarrow.lib.check_status
> pyarrow.lib.ArrowNotImplementedError: lists with structs are not supported.
> {code}
> I somehow get the impression that after 
> https://issues.apache.org/jira/browse/PARQUET-911 is merged, we should be 
> able to load the nested parquet in pyarrow. 
> Any insight about this? 
> Thanks.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-1644) [C++][Parquet] Read and write nested Parquet data with a mix of struct and list nesting levels

Reply via email to