LantaoJin commented on a change in pull request #29869:
URL: https://github.com/apache/spark/pull/29869#discussion_r497203916



##########
File path: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
##########
@@ -274,8 +274,36 @@ private[spark] class DAGScheduler(
       accumUpdates: Seq[AccumulatorV2[_, _]],
       metricPeaks: Array[Long],
       taskInfo: TaskInfo): Unit = {
+    // It may cause heavy full GC problem if some heavy external accumulators 
keep in memory.
+    // We update these heavy accumulators before they entering into Spark 
listener event loop.
+    val (heavyAccumUpdates, otherAccumUpdates) = accumUpdates.partition { acc 
=>

Review comment:
       Yes. "Heavy" means the size is too big, such as the accumulator value 
contains thousands of file paths. I know this looks not a good solution. Do you 
mean to use a size threshold to filter out the big accumulator values? The 
problem that way is we split the same accumulators (same name with different 
sizes). Actually, Spark uses METRICS_PREFIX to distinguish the internal metrics 
from all accumulators `val METRICS_PREFIX = "internal.metrics."`. We can split 
to `val (externalAccumUpdates, internalAccumUpdates)` and update/merge the 
`externalAccumUpdates` before they entering into event loop. But, you know, we 
still cannot forbid developers to use the "internal.metrics." prefix. We only 
could add some comments in code and document it.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to