[GitHub] [spark] VitoMakarevich opened a new pull request, #42893: [SPARK-44459][Core] Add System.runFinalization() to periodic cleanup

via GitHub Tue, 12 Sep 2023 11:01:55 -0700


VitoMakarevich opened a new pull request, #42893:
URL: https://github.com/apache/spark/pull/42893

### What changes were proposed in this pull request?
The description is listed in [Jira
SPARK-44459](https://issues.apache.org/jira/browse/SPARK-44459).
I faced an issue when running ~90 streaming queries in parallel in one
driver. Driver memory footprint(max heap and used heap) was constantly
increasing until reached the node limit(in my case ~110 GBs) and the driver got
killed. When I checked the heap dump, I observed hundred of millions of
`java.lang.ref.Finalizer` and `java.util.zip.*` entries. When I dug into the
code, I found that on Java 8 - many classes from `java.util.zip.*` include
non-empty `finalize` method.
This, in turn, leads to accumulation of references to classes to be GCed,
and only `finalizer` thread(**one per JVM**) can run the non-empty `finalize`
method, and only then memory will be released by GC. Also, many classes from
`java.util.zip.*` hold references to native(C) memory, so the situation is even
worse since not only heap memory is accumulating. I verified it's an issue with
finalization queue when checked JMX bean related to finalization queue size.
I ensured this comes from the Spark package by finding a problematic part of
Spark - when I turned off SparkUI - the problem disappeared. As per my
findings, the issue comes from `netty` using `java.util.zip` heavily, and a
large number of tasks leads to an enormous production rate, so in multithreaded
code `finalizer` thread could not keep up with the increasing `Reference` queue.
There are [detailed article with
figures](https://medium.com/@vitali.makarevich/spark-structured-streaming-and-java-util-zip-and-finalize-method-83181c6bc86f)
- let me know if you want me to put it all here.
I made a fix by running logic similar to existing in spark with
`System.runFinalization()` and memory consumption normalized and the
application can run for weeks, but it would be good if others had the same fix
and didn't spend weeks finding the cause.
If you want, I can add a configuration value controlling whether this logic
should be run, so any potential bottlenecks could be avoided by existing users.

### Why are the changes needed?
Java 8 environment with similar conditions(e.g. a lot of streaming queries
in parallel with a lot of tasks) should have the same problem.

### Does this PR introduce _any_ user-facing change?
It has no change the user will face, but in theory, with some circumstances
client may see reduction in performance.

### How was this patch tested?
I have an application running this in production. I don't expect any
existing test to fail since this logic only improves the cleanup process and
it's not covered by any existing test. I don't expect performance/benchmark to
degrade(although can admit there may be problems with some combination of
JVM/GC).
I think there may be a way to check the finalization queue size from the
code(similar to the JMX bean I used to catch the issue) to check that it has an
effect, but it looks to be a very native operation that has no possibility to
work incorrectly.

### Was this patch authored or co-authored using generative AI tooling?
No

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] VitoMakarevich opened a new pull request, #42893: [SPARK-44459][Core] Add System.runFinalization() to periodic cleanup

Reply via email to