[jira] [Created] (ARROW-1644) Parquet with nested structs can not be loaded in pyarrow in Oct 2017 nightly build

DB Tsai (JIRA) Wed, 04 Oct 2017 17:45:08 -0700

DB Tsai created ARROW-1644:
------------------------------

             Summary: Parquet with nested structs can not be loaded in pyarrow 
in Oct 2017 nightly build
                 Key: ARROW-1644
                 URL: https://issues.apache.org/jira/browse/ARROW-1644
             Project: Apache Arrow
          Issue Type: Bug
          Components: Python
    Affects Versions: 0.8.0
            Reporter: DB Tsai



We have many nested parquet files generated from Apache Spark for ranking 
problems, and we would like to load them in python for other programs to 
consume. 

The schema looks like 
{code:java}
root
 |-- profile_id: long (nullable = true)
 |-- country_iso_code: string (nullable = true)
 |-- items: array (nullable = false)
 |    |-- element: struct (containsNull = false)
 |    |    |-- show_title_id: integer (nullable = true)
 |    |    |-- duration: double (nullable = true)
{code}

And when I tried to load it with nightly build pyarrow on Oct 4, 2017, I got 
the following error.
{code:python}
Python 3.6.2 |Anaconda, Inc.| (default, Sep 30 2017, 18:42:57) 
[GCC 7.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import numpy as np
>>> import pandas as pd
>>> import pyarrow as pa
>>> import pyarrow.parquet as pq
>>> table2 = pq.read_table('part-00000')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/parquet.py", 
line 823, in read_table
    use_pandas_metadata=use_pandas_metadata)
  File "/home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/parquet.py", 
line 119, in read
    nthreads=nthreads)
  File "_parquet.pyx", line 466, in pyarrow._parquet.ParquetReader.read_all
  File "error.pxi", line 85, in pyarrow.lib.check_status
pyarrow.lib.ArrowNotImplementedError: lists with structs are not supported.
{code}

I somehow get the impression that after 
https://issues.apache.org/jira/browse/PARQUET-911 is merged, we should be able 
to load the nested parquet in pyarrow. 

Any insight about this? 

Thanks.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Created] (ARROW-1644) Parquet with nested structs can not be loaded in pyarrow in Oct 2017 nightly build

Reply via email to