JoshRosen commented on code in PR #46705:
URL: https://github.com/apache/spark/pull/46705#discussion_r1613904865
##########
core/src/main/scala/org/apache/spark/executor/TaskMetrics.scala:
##########
@@ -328,16 +328,15 @@ private[spark] object TaskMetrics extends Logging {
*/
def fromAccumulators(accums: Seq[AccumulatorV2[_, _]]): TaskMetrics = {
val tm = new TaskMetrics
- for (acc <- accums) {
+ val (innerAccums, externalAccums) = accums.
+ partition(t => t.name.isDefined && tm.nameToAccums.contains(t.name.get))
Review Comment:
Just thinking aloud here:
Do we do a lot of intermediate collections allocations in this `.partition`
operation itself? e.g. under the hood, would we be resizing an array builder or
doing a bunch of linked list construction operations?
I am wondering whether instead it might be faster (or at least more
predicable / easier to reason about) to construct a Java ArrayBuilder /
ArrayBuffer, imperatively append the external accumulators to it, then do a
single addAll call at the end, e.g. something like
```scala
def fromAccumulators(accums: Seq[AccumulatorV2[_, _]]): TaskMetrics = {
val tm = new TaskMetrics
val externalAccums = new java.util.ArrayList[AccumulatorV2[Any, Any]]()
for (acc <- accums) {
val name = acc.name
if (name.isDefined && tm.nameToAccums.contains(name.get)) {
val tmAcc =
tm.nameToAccums(name.get).asInstanceOf[AccumulatorV2[Any, Any]]
tmAcc.metadata = acc.metadata
tmAcc.merge(acc.asInstanceOf[AccumulatorV2[Any, Any]])
} else {
externalAccums.add(acc)
}
}
tm._externalAccums.addAll(externalAccums)
tm
}
```
This is less net code churn and is probably either comparable in performance
or faster, and avoids any performance unpredictability as a function of the Seq
type.
Given that this code runs in a hot path, I think the use of imperative
mutation vs. a more pure functional approach is fine.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]