[
https://issues.apache.org/jira/browse/ARROW-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16784864#comment-16784864
]
David Lee commented on ARROW-1644:
----------------------------------
I've been able to write parquet columns which are lists, but I haven't been
able to write a column which is a list of struct(s)
This works:
{code:java}
schema = pa.schema([
pa.field('test_id', pa.string()),
pa.field('a', pa.list_(pa.string())),
pa.field('b', pa.list_(pa.int32()))
])
{code}
This structure isn't supported yet
{code:java}
schema = pa.schema([
pa.field('test_id', pa.string()),
pa.field('testlist', pa.list_(pa.struct([('a', pa.string()), ('b',
pa.int32())])))
])
new_records = list()
new_records.append({'test_id': '123', 'testlist': [{'a': 'xyz', 'b': 22}]})
new_records.append({'test_id': '789', 'testlist': [{'a': 'aaa', 'b': 33}]})
arrow_columns = list()
for column in schema.names:
arrow_columns.append(pa.array([v[column] for v in new_records],
type=schema.types[schema.get_field_index(column)]))
arrow_table = pa.Table.from_arrays(arrow_columns, schema.names)
arrow_table
arrow_table[0]
arrow_table[1]
arrow_table[1][0]
arrow_table[1][1]
>>> pq.write_table(arrow_table, "test.parquet")
Traceback (most recent call last):
packages/pyarrow/parquet.py", line 1160, in write_table
writer.write_table(table, row_group_size=row_group_size)
File "/proj/pag/python/current/lib/python3.6/site-packages/pyarrow/parquet.py",
line 405, in write_table
self.writer.write_table(table, row_group_size=row_group_size)
File "pyarrow/_parquet.pyx", line 924, in
pyarrow._parquet.ParquetWriter.write_table
File "pyarrow/error.pxi", line 81, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Nested column branch had multiple children
{code}
Supporting structs is the missing piece to being able to save structured JSON
as columnar parquet which would make json searchable.
> [Python] Read and write nested Parquet data with a mix of struct and list
> nesting levels
> ----------------------------------------------------------------------------------------
>
> Key: ARROW-1644
> URL: https://issues.apache.org/jira/browse/ARROW-1644
> Project: Apache Arrow
> Issue Type: New Feature
> Components: Python
> Affects Versions: 0.8.0
> Reporter: DB Tsai
> Assignee: Joshua Storck
> Priority: Major
> Labels: parquet, pull-request-available
> Fix For: 0.14.0
>
>
> We have many nested parquet files generated from Apache Spark for ranking
> problems, and we would like to load them in python for other programs to
> consume.
> The schema looks like
> {code:java}
> root
> |-- profile_id: long (nullable = true)
> |-- country_iso_code: string (nullable = true)
> |-- items: array (nullable = false)
> | |-- element: struct (containsNull = false)
> | | |-- show_title_id: integer (nullable = true)
> | | |-- duration: double (nullable = true)
> {code}
> And when I tried to load it with nightly build pyarrow on Oct 4, 2017, I got
> the following error.
> {code:python}
> Python 3.6.2 |Anaconda, Inc.| (default, Sep 30 2017, 18:42:57)
> [GCC 7.2.0] on linux
> Type "help", "copyright", "credits" or "license" for more information.
> >>> import numpy as np
> >>> import pandas as pd
> >>> import pyarrow as pa
> >>> import pyarrow.parquet as pq
> >>> table2 = pq.read_table('part-00000')
> Traceback (most recent call last):
> File "<stdin>", line 1, in <module>
> File "/home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/parquet.py",
> line 823, in read_table
> use_pandas_metadata=use_pandas_metadata)
> File "/home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/parquet.py",
> line 119, in read
> nthreads=nthreads)
> File "_parquet.pyx", line 466, in pyarrow._parquet.ParquetReader.read_all
> File "error.pxi", line 85, in pyarrow.lib.check_status
> pyarrow.lib.ArrowNotImplementedError: lists with structs are not supported.
> {code}
> I somehow get the impression that after
> https://issues.apache.org/jira/browse/PARQUET-911 is merged, we should be
> able to load the nested parquet in pyarrow.
> Any insight about this?
> Thanks.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)