Truc Lam Nguyen created ARROW-11497:
---------------------------------------
Summary: [Python] pyarrow parquet writer for list does not conform
with Apache Parquet sepecification
Key: ARROW-11497
URL: https://issues.apache.org/jira/browse/ARROW-11497
Project: Apache Arrow
Issue Type: Bug
Components: Python
Affects Versions: 3.0.0
Reporter: Truc Lam Nguyen
Attachments: parquet-tools-meta.log
Sorry if I don't know this feature is done deliberately, but it looks like the
parquet writer for list data type does not confirm to Apache Parquet list
logical type specification,
According to this page:
[https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#lists,]
list type contains 3 level where the middle level, named {{list}}, must be a
repeated group with a single field named _{{element}}_
However, in the parquet file from pyarrow writer, that single field is named
_item_ instead,
Please find below the example python code that produce a parquet file (I use
pandas version 1.2.1 and pyarrow version 3.0.0)
{code:java}
import pandas as pd
df = pd.DataFrame(data=[ {'studio': 'blizzard', 'games': [{'name': 'diablo',
'version': '3'}, {'name': 'star craft', 'version': '2'}]}, {'studio': 'ea',
'games': [{'name': 'fifa', 'version': '21'}]}, ])
df.to_parquet('/tmp/test.parquet', engine='pyarrow')
{code}
Then I use parquet-tools from [https://formulae.brew.sh/formula/parquet-tools]
to check the metadata of parquet file via this command
parquet-tools meta /tmp/test.parquet
The full meta is included in attached, here is only an extraction of list type
column
games: OPTIONAL F:1
.list: REPEATED F:1
..item: OPTIONAL F:2
...name: OPTIONAL BINARY L:STRING R:1 D:4
...version: OPTIONAL BINARY L:STRING R:1 D:4
as can be seen, under list, it is single field named _item_
I think this should be made to be name _element_ to conform with Apache Parquet
specification.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)