[
https://issues.apache.org/jira/browse/ARROW-1459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Uwe L. Korn resolved ARROW-1459.
--------------------------------
Resolution: Fixed
Issue resolved by pull request 1090
[https://github.com/apache/arrow/pull/1090]
> [Python] PyArrow fails to load partitioned parquet files with non-primitive
> types
> ---------------------------------------------------------------------------------
>
> Key: ARROW-1459
> URL: https://issues.apache.org/jira/browse/ARROW-1459
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Affects Versions: 0.6.0
> Reporter: Jonas Amrich
> Assignee: Wes McKinney
> Fix For: 0.7.0
>
>
> When reading partitioned parquet files (tested with those produced by Spark),
> that contain lists, the resulting table seems to contain data loaded only
> from one partition. Primitive types seems to be loaded correctly.
> It can be reproduced using following code (arrow 0.6.0, spark 2.1.1):
> {noformat}
> >>> df = spark.createDataFrame(list(zip(np.arange(10).tolist(),
> >>> np.arange(20).reshape((10,2)).tolist())))
> >>> df.toPandas()
> _1 _2
> 0 0 [0, 1]
> 1 1 [2, 3]
> 2 2 [4, 5]
> 3 3 [6, 7]
> 4 4 [8, 9]
> 5 5 [10, 11]
> 6 6 [12, 13]
> 7 7 [14, 15]
> 8 8 [16, 17]
> 9 9 [18, 19]
> >>> df.repartition(2).write.parquet('df_parts.parquet')
> >>> pq.read_table('df_parts.parquet').to_pandas()
> _1 _2
> 0 0 [0, 1]
> 1 2 [4, 5]
> 2 4 [8, 9]
> 3 6 [12, 13]
> 4 8 [16, 17]
> 5 1 [0, 1]
> 6 3 [4, 5]
> 7 5 [8, 9]
> 8 7 [12, 13]
> 9 9 [16, 17]
> {noformat}
> When the data is loaded using Spark or coalesced into one partition,
> everything works as expected:
> {noformat}
> >>> spark.read.parquet('df_parts.parquet').toPandas()
> _1 _2
> 0 1 [2, 3]
> 1 3 [6, 7]
> 2 5 [10, 11]
> 3 7 [14, 15]
> 4 9 [18, 19]
> 5 0 [0, 1]
> 6 2 [4, 5]
> 7 4 [8, 9]
> 8 6 [12, 13]
> 9 8 [16, 17]
> >>> df.coalesce(1).write.parquet('df_single.parquet')
> >>> pq.read_table('df_single.parquet').to_pandas()
> _1 _2
> 0 0 [0, 1]
> 1 1 [2, 3]
> 2 2 [4, 5]
> 3 3 [6, 7]
> 4 4 [8, 9]
> 5 5 [10, 11]
> 6 6 [12, 13]
> 7 7 [14, 15]
> 8 8 [16, 17]
> 9 9 [18, 19]
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)