Jonas Amrich created ARROW-1459:
-----------------------------------
Summary: [Python] PyArrow fails to load partitioned parquet files
with non-primitive types
Key: ARROW-1459
URL: https://issues.apache.org/jira/browse/ARROW-1459
Project: Apache Arrow
Issue Type: Bug
Components: Python
Affects Versions: 0.6.0
Reporter: Jonas Amrich
When reading partitioned parquet files (tested with those produced by Spark),
that contain lists, the resulting table seems to contain data loaded only from
one partition. Primitive types seems to be loaded correctly.
It can be reproduced using following code (arrow 0.6.0, spark 2.1.1):
{noformat}
>>> df = spark.createDataFrame(list(zip(np.arange(10).tolist(),
>>> np.arange(20).reshape((10,2)).tolist())))
>>> df.toPandas()
_1 _2
0 0 [0, 1]
1 1 [2, 3]
2 2 [4, 5]
3 3 [6, 7]
4 4 [8, 9]
5 5 [10, 11]
6 6 [12, 13]
7 7 [14, 15]
8 8 [16, 17]
9 9 [18, 19]
>>> df.repartition(2).write.parquet('df_parts.parquet')
>>> pq.read_table('df_parts.parquet').to_pandas()
_1 _2
0 0 [0, 1]
1 2 [4, 5]
2 4 [8, 9]
3 6 [12, 13]
4 8 [16, 17]
5 1 [0, 1]
6 3 [4, 5]
7 5 [8, 9]
8 7 [12, 13]
9 9 [16, 17]
{noformat}
When the data is loaded using Spark or coalesced into one partition, everything
works as expected:
{noformat}
>>> spark.read.parquet('df_parts.parquet').toPandas()
_1 _2
0 1 [2, 3]
1 3 [6, 7]
2 5 [10, 11]
3 7 [14, 15]
4 9 [18, 19]
5 0 [0, 1]
6 2 [4, 5]
7 4 [8, 9]
8 6 [12, 13]
9 8 [16, 17]
>>> df.coalesce(1).write.parquet('df_single.parquet')
>>> pq.read_table('df_single.parquet').to_pandas()
_1 _2
0 0 [0, 1]
1 1 [2, 3]
2 2 [4, 5]
3 3 [6, 7]
4 4 [8, 9]
5 5 [10, 11]
6 6 [12, 13]
7 7 [14, 15]
8 8 [16, 17]
9 9 [18, 19]
{noformat}
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)