[
https://issues.apache.org/jira/browse/ARROW-11497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Micah Kornfield resolved ARROW-11497.
-------------------------------------
Fix Version/s: 4.0.0
Resolution: Fixed
Issue resolved by pull request 9489
[https://github.com/apache/arrow/pull/9489]
> [Python] pyarrow parquet writer for list does not conform with Apache Parquet
> specification
> -------------------------------------------------------------------------------------------
>
> Key: ARROW-11497
> URL: https://issues.apache.org/jira/browse/ARROW-11497
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Affects Versions: 3.0.0
> Reporter: Truc Lam Nguyen
> Priority: Major
> Labels: pull-request-available
> Fix For: 4.0.0
>
> Attachments: parquet-tools-meta.log
>
> Time Spent: 2.5h
> Remaining Estimate: 0h
>
> Sorry if I don't know this feature is done deliberately, but it looks like
> the parquet writer for list data type does not conform to Apache Parquet list
> logical type specification
> According to this page:
> [https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#lists,]
> list type contains 3 level where the middle level, named {{list}}, must be a
> repeated group with a single field named _{{element}}_
> However, in the parquet file from pyarrow writer, that single field is named
> _item_ instead,
> Please find below the example python code that produce a parquet file (I use
> pandas version 1.2.1 and pyarrow version 3.0.0)
> {code:java}
> import pandas as pd
>
> df = pd.DataFrame(data=[ {'studio': 'blizzard', 'games': [{'name': 'diablo',
> 'version': '3'}, {'name': 'star craft', 'version': '2'}]}, {'studio': 'ea',
> 'games': [{'name': 'fifa', 'version': '21'}]}, ])
> df.to_parquet('/tmp/test.parquet', engine='pyarrow')
> {code}
> Then I use parquet-tools from
> [https://formulae.brew.sh/formula/parquet-tools] to check the metadata of
> parquet file via this command
> parquet-tools meta /tmp/test.parquet
> The full meta is included in attached, here is only an extraction of list
> type column
> games: OPTIONAL F:1
> .list: REPEATED F:1
> ..item: OPTIONAL F:2
> ...name: OPTIONAL BINARY L:STRING R:1 D:4
> ...version: OPTIONAL BINARY L:STRING R:1 D:4
> as can be seen, under list, it is single field named _item_
> I think this should be made to be name _element_ to conform with Apache
> Parquet specification.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)