LantaoJin commented on a change in pull request #29869:
URL: https://github.com/apache/spark/pull/29869#discussion_r497203916
##########
File path: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
##########
@@ -274,8 +274,36 @@ private[spark] class DAGScheduler(
accumUpdates: Seq[AccumulatorV2[_, _]],
metricPeaks: Array[Long],
taskInfo: TaskInfo): Unit = {
+ // It may cause heavy full GC problem if some heavy external accumulators
keep in memory.
+ // We update these heavy accumulators before they entering into Spark
listener event loop.
+ val (heavyAccumUpdates, otherAccumUpdates) = accumUpdates.partition { acc
=>
Review comment:
Yes. "Heavy" means the size is too big, such as the accumulator value
contains thousands of file paths. I know this looks not a good solution. Do you
mean to use a size threshold to filter out the big accumulator values? The
problem that way is we split the same accumulators (same name with different
sizes). Actually, Spark uses METRICS_PREFIX to distinguish the internal metrics
from all accumulators `val METRICS_PREFIX = "internal.metrics."`. We can
separate them to `val (externalAccumUpdates, internalAccumUpdates)` and
update/merge the `externalAccumUpdates` before they entering into event loop.
But, you know, we still cannot forbid developers to use the "internal.metrics."
prefix. We only could add some comments in code and document it.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]