[
https://issues.apache.org/jira/browse/SPARK-3364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Guoqiang Li resolved SPARK-3364.
--------------------------------
Resolution: Fixed
> Zip equal-length but unequally-partition
> ----------------------------------------
>
> Key: SPARK-3364
> URL: https://issues.apache.org/jira/browse/SPARK-3364
> Project: Spark
> Issue Type: Bug
> Components: Spark Core
> Affects Versions: 1.0.2
> Reporter: Kevin Jung
> Fix For: 1.1.0
>
>
> ZippedRDD losts some elements after zipping RDDs with equal numbers of
> partitions but unequal numbers of elements in their each partitions.
> This can happen when a user creates RDD by sc.textFile(path,partitionNumbers)
> with physically unbalanced HDFS file.
> {noformat}
> var x = sc.parallelize(1 to 9,3)
> var y = sc.parallelize(Array(1,1,1,1,1,2,2,3,3),3).keyBy(i=>i)
> var z = y.partitionBy(new RangePartitioner(3,y))
> expected
> x.zip(y).count()
> 9
> x.zip(y).collect()
> Array[(Int, (Int, Int))] = Array((1,(1,1)), (2,(1,1)), (3,(1,1)), (4,(1,1)),
> (5,(1,1)), (6,(2,2)), (7,(2,2)), (8,(3,3)), (9,(3,3)))
> unexpected
> x.zip(z).count()
> 7
> x.zip(z).collect()
> Array[(Int, (Int, Int))] = Array((1,(1,1)), (2,(1,1)), (3,(1,1)), (4,(2,2)),
> (5,(2,2)), (7,(3,3)), (8,(3,3)))
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]