[
https://issues.apache.org/jira/browse/SPARK-44459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated SPARK-44459:
-----------------------------------
Labels: MemoryLeak gzip memory-bug memory-control pull-request-available
structured-streaming (was: MemoryLeak gzip memory-bug memory-control
structured-streaming)
> Garbage collection doesn't include finalization run
> ---------------------------------------------------
>
> Key: SPARK-44459
> URL: https://issues.apache.org/jira/browse/SPARK-44459
> Project: Spark
> Issue Type: Improvement
> Components: Structured Streaming, Web UI
> Affects Versions: 3.3.0
> Environment: AWS EMR 6.9
> Hudi 0.12.1
> Spark 3.3.0
> Reporter: Vitali Makarevich
> Priority: Major
> Labels: MemoryLeak, gzip, memory-bug, memory-control,
> pull-request-available, structured-streaming
>
>
> {panel:title=Problem description}
> Full text with figures available [here(4 min
> read)|https://medium.com/@vitaliy.makarevich.work/spark-structured-streaming-and-java-util-zip-and-finalize-method-83181c6bc86f]
> , I can post it here as well, but will post a shorter version.
> When running a relatively big application(dozens of streams in parallel),
> Spark driver is growing in memory up to 110GBs(at this moment I was stopping
> the test). When I check the heapdump/JMX finalization queue size, I see it's
> struggling with accumulating java.lang.ref.Finalizer and underlying objects
> in them. Most of the objects in finalize queue are from java.util.zip package.
> {panel}
>
> {panel:title=Underlying Java implementation}
> In a nutshell, there is a Java 8 finalizer method in Object. If it's not
> empty when the object is garbage collected, it's not removed once found
> unused but put in the Finalizer queue. Then JVM runs the Finalizer thread
> which takes each object from the queue and runs `finalize`. The problem is
> for big applications, the finalizer queue grows incomparably with
> finalization frequency/thread priority.
> Very frequently, zip package instances are referring to C memory(since it's
> implemented in Native way), so even native memory is not being cleaned until
> `finalize` is called.
> {panel}
>
> {panel:title=Application}
> As for the application I caught it - it runs 90 streaming queries in parallel
> with a batch frequency of about 1 hour. The application is reading data from
> [Apache Hudi|https://hudi.apache.org/] and writes output to another path in
> Apache Hudi(0.12.1 version). It's running on AWS EMR 6.9 on Java 8.
> Spark UI/Event log is enabled with settings
> {code:java}
> "spark.ui.enabled" = "true"
> ## How many jobs the Spark UI and status APIs remember before garbage
> collecting.
> ## This is a target maximum, and fewer elements may be retained in some
> circumstances.
> ## Default value: 1000
> "spark.ui.retainedJobs" = "100" ## How many stages the Spark UI and
> status APIs remember before garbage collecting.
> ## This is a target maximum, and fewer elements may be retained in some
> circumstances.
> ## Default value: 1000
> "spark.ui.retainedStages" = "50" ## How many tasks in one stage the
> Spark UI and status APIs remember before garbage collecting.
> ## This is a target maximum, and fewer elements may be retained in some
> circumstances.
> ## Default value: 100000
> "spark.ui.retainedTasks" = "50" ## How many DAG graph nodes the Spark
> UI and status APIs remember before garbage collecting.
> ## Default value: Int.MaxValue (2^31) - Here we use 2^15 instead.
> "spark.ui.dagGraph.retainedRootRDDs" = "32768"
> "spark.worker.ui.retainedExecutors" = "10"
> "spark.worker.ui.retainedDrivers" = "10"
> "spark.sql.ui.retainedExecutions" = "10"
> "spark.streaming.ui.retainedBatches" = "10" "spark.eventLog.enabled":
> "true"
> "spark.eventLog.rotation.enabled" : "true",
> "spark.eventLog.rotation.interval" : "3600",
> "spark.eventLog.rotation.minFileSize" : "1024m",
> "spark.eventLog.rotation.maxFilesToRetain" : "5" {code}
>
> Spark UI is a crucial part since without it(with disabled), memory
> consumption is fine. I've played around with it, unfortunately even with very
> conservative settings it doesn't work well.
> {panel}
>
> {panel:title=What is the inner source of the issue}
> I'm not sure about the source of the issue, but it looks like the Driver is
> heavily using zip package for small data. I assume it's coming from some
> networking where traffic is compressed(I saw some java.util.zip instances
> coming from Netty, but once they are GCed I could not track back the source
> since it's referenced only be FinalizerQueue).
> {panel}
>
> {panel:title=Proposed solution}
> As a workaround, I've added a background service that runs
> `System.runFinalization()` with the same frequency as
> `spark.cleaner.periodicGC.interval` and it works well, memory consumption
> stays stable at an acceptable level(from indefinite growth to >100GB it stays
> at 60-70 GB total heap(~40GB used) which I consider ok for such intensive
> application).
> So proposed solution is to add
> [here|https://github.com/apache/spark/blob/85d8d62216d3b830cc5af3dec05422a9cda4cea0/core/src/main/scala/org/apache/spark/ContextCleaner.scala#L131]
> a `System.runFinalization()` call. I don't think there is any drawback
> related to it(like reduced performance or so). But it may be added as a
> separate service like the current `System.gc` or under a feature flag for
> compatibility as well.
> I'll be able to create a patchset for it once someone confirms it's
> acceptable.
> {panel}
>
>
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]