[ https://issues.apache.org/jira/browse/SPARK-17817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Felix Cheung resolved SPARK-17817. ---------------------------------- Resolution: Fixed Assignee: Liang-Chi Hsieh Fix Version/s: 2.1.0 > PySpark RDD Repartitioning Results in Highly Skewed Partition Sizes > ------------------------------------------------------------------- > > Key: SPARK-17817 > URL: https://issues.apache.org/jira/browse/SPARK-17817 > Project: Spark > Issue Type: Bug > Affects Versions: 1.6.1, 1.6.2, 2.0.0, 2.0.1 > Reporter: Mike Dusenberry > Assignee: Liang-Chi Hsieh > Fix For: 2.1.0 > > > Calling {{repartition}} on a PySpark RDD to increase the number of partitions > results in highly skewed partition sizes, with most having 0 rows. The > {{repartition}} method should evenly spread out the rows across the > partitions, and this behavior is correctly seen on the Scala side. > Please reference the following code for a reproducible example of this issue: > {code} > # Python > num_partitions = 20000 > a = sc.parallelize(range(int(1e6)), 2) # start with 2 even partitions > l = a.repartition(num_partitions).glom().map(len).collect() # get length of > each partition > min(l), max(l), sum(l)/len(l), len(l) # skewed! > # Scala > val numPartitions = 20000 > val a = sc.parallelize(0 until 1e6.toInt, 2) # start with 2 even partitions > val l = a.repartition(numPartitions).glom().map(_.length).collect() # get > length of each partition > print(l.min, l.max, l.sum/l.length, l.length) # even! > {code} > The issue here is that highly skewed partitions can result in severe memory > pressure in subsequent steps of a processing pipeline, resulting in OOM > errors. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org