Re: [PR] [SPARK-48391][CORE]Using addAll instead of add function in fromAccumulatorInfos method of TaskMetrics Class [spark]

via GitHub Fri, 24 May 2024 12:10:10 -0700


JoshRosen commented on code in PR #46705:
URL: https://github.com/apache/spark/pull/46705#discussion_r1613904865



##########
core/src/main/scala/org/apache/spark/executor/TaskMetrics.scala:
##########
@@ -328,16 +328,15 @@ private[spark] object TaskMetrics extends Logging {
    */
   def fromAccumulators(accums: Seq[AccumulatorV2[_, _]]): TaskMetrics = {
     val tm = new TaskMetrics
-    for (acc <- accums) {
+    val (innerAccums, externalAccums) = accums.
+      partition(t => t.name.isDefined && tm.nameToAccums.contains(t.name.get))

Review Comment:
   Just thinking aloud here:
   
   Do we do a lot of intermediate collections allocations in this `.partition` 
operation itself? e.g. under the hood, would we be resizing an array builder or 
doing a bunch of linked list construction operations?
   
   I am wondering whether instead it might be faster (or at least more 
predicable / easier to reason about) to construct a Java ArrayBuilder / 
ArrayBuffer, imperatively append the external accumulators to it, then do a 
single addAll call at the end, e.g. something like
   
   ```scala
   def fromAccumulators(accums: Seq[AccumulatorV2[_, _]]): TaskMetrics = {
       val tm = new TaskMetrics
       val externalAccums = new java.util.ArrayList[AccumulatorV2[Any, Any]]()
       for (acc <- accums) {
         val name = acc.name
         if (name.isDefined && tm.nameToAccums.contains(name.get)) {
           val tmAcc = 
tm.nameToAccums(name.get).asInstanceOf[AccumulatorV2[Any, Any]]
           tmAcc.metadata = acc.metadata
           tmAcc.merge(acc.asInstanceOf[AccumulatorV2[Any, Any]])
         } else {
           externalAccums.add(acc)
         }
       }
       tm._externalAccums.addAll(externalAccums)
       tm
     }
   ```
   
   This is less net code churn and is probably either comparable in performance 
or faster, and avoids any performance unpredictability as a function of the Seq 
type.
   
   Given that this code runs in a hot path, I think the use of imperative 
mutation vs. a more pure functional approach is fine.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-48391][CORE]Using addAll instead of add function in fromAccumulatorInfos method of TaskMetrics Class [spark]

Reply via email to