Github user tgravescs commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11060#discussion_r52244984
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/CoalescedRDD.scala ---
    @@ -273,7 +273,8 @@ private class PartitionCoalescer(maxPartitions: Int, 
prev: RDD[_], balanceSlack:
           groupArr += pgroup
           groupHash.getOrElseUpdate(nxt_replica, ArrayBuffer()) += pgroup
           var tries = 0
    -      while (!addPartToPGroup(nxt_part, pgroup) && tries < targetLen) { // 
ensure at least one part
    +      // ensure at least one part
    +      while (pgroup.arr.isEmpty && !addPartToPGroup(nxt_part, pgroup) && 
tries < targetLen) {
    --- End diff --
    
    I was looking at this some more and this isn't the right solution. The 
issue we are seeing is because it runs out of partitions to hand out but it 
still loops ~n^2 times and in our case it is 300*2000 and for each of those 
times it gets partition information from hdfs. This takes about 3 seconds to do 
the inner 2000 iterations and doing that 300 times adding up to like 15 minutes.
    
    We shouldn't do the inner while loop if all partitions have been given out, 
but I want to understand a few things better to make sure that is all.
    
     Working with @zhuoliu 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to