[GitHub] [spark] ulysses-you commented on pull request #34820: [SPARK-37559][SQL] ShuffledRowRDD get preferred locations order by reduce size

GitBox Mon, 06 Dec 2021 19:14:17 -0800


ulysses-you commented on pull request #34820:
URL: https://github.com/apache/spark/pull/34820#issuecomment-987528689



   @maryannxue `CoalescedPartitionSpec` only cares about the reduce partition 
so the max length need to sort is the shuffle partition number, so I think the 
added complexity is no harm.
   
   To be conservative, I do a benchmark using the added complexity with 100000 
shuffle partitions:
   ```scala
   val shufflePartitions = 100000
   val rand = new Random(0)
   val bytesByPartitionId = Seq.tabulate(shufflePartitions)(i => 
rand.nextLong().abs).toArray
   
   var previous = 0
   var next = 0
   val partitions = new ArrayBuffer[(Int, Int)]()
   while (next < shufflePartitions) {
     next = next + rand.nextInt(100)
     partitions.append((previous, next.min(shufflePartitions - 1)))
     previous = next
   }
   
   val start = System.nanoTime()
   partitions.foreach { case (start, end) =>
     (start until end)
       .map(index => (index, bytesByPartitionId(index)))
       .sortBy(_._2)(implicitly[Ordering[Long]].reverse)
   }
   println((System.nanoTime() - start) / 1000000)
   ```
   
   It takes 73ms.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] ulysses-you commented on pull request #34820: [SPARK-37559][SQL] ShuffledRowRDD get preferred locations order by reduce size

Reply via email to