LantaoJin commented on a change in pull request #29869:
URL: https://github.com/apache/spark/pull/29869#discussion_r497203916



##########
File path: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
##########
@@ -274,8 +274,36 @@ private[spark] class DAGScheduler(
       accumUpdates: Seq[AccumulatorV2[_, _]],
       metricPeaks: Array[Long],
       taskInfo: TaskInfo): Unit = {
+    // It may cause heavy full GC problem if some heavy external accumulators 
keep in memory.
+    // We update these heavy accumulators before they entering into Spark 
listener event loop.
+    val (heavyAccumUpdates, otherAccumUpdates) = accumUpdates.partition { acc 
=>

Review comment:
       Yes. "Heavy" means the size is too big, such as the accumulator value 
contains thousands of file paths. I know this looks not a good solution. Do you 
mean to use a size threshold to filter out the big accumulator values is a 
better solution? The problem that way is we split the same accumulators (same 
name with different sizes). Actually, Spark uses METRICS_PREFIX to distinguish 
the internal metrics from all accumulators `val METRICS_PREFIX = 
"internal.metrics."`. We can split to `val (externalAccumUpdates, 
internalAccumUpdates)` and update/merge the `externalAccumUpdates` before they 
entering into event loop. But, you know, we still cannot forbid developers to 
use the "internal.metrics." prefix. We only could add some comments in code and 
document it.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to