VitoMakarevich opened a new pull request, #42893: URL: https://github.com/apache/spark/pull/42893
### What changes were proposed in this pull request? The description is listed in [Jira SPARK-44459](https://issues.apache.org/jira/browse/SPARK-44459). I faced an issue when running ~90 streaming queries in parallel in one driver. Driver memory footprint(max heap and used heap) was constantly increasing until reached the node limit(in my case ~110 GBs) and the driver got killed. When I checked the heap dump, I observed hundred of millions of `java.lang.ref.Finalizer` and `java.util.zip.*` entries. When I dug into the code, I found that on Java 8 - many classes from `java.util.zip.*` include non-empty `finalize` method. This, in turn, leads to accumulation of references to classes to be GCed, and only `finalizer` thread(**one per JVM**) can run the non-empty `finalize` method, and only then memory will be released by GC. Also, many classes from `java.util.zip.*` hold references to native(C) memory, so the situation is even worse since not only heap memory is accumulating. I verified it's an issue with finalization queue when checked JMX bean related to finalization queue size. I ensured this comes from the Spark package by finding a problematic part of Spark - when I turned off SparkUI - the problem disappeared. As per my findings, the issue comes from `netty` using `java.util.zip` heavily, and a large number of tasks leads to an enormous production rate, so in multithreaded code `finalizer` thread could not keep up with the increasing `Reference` queue. There are [detailed article with figures](https://medium.com/@vitali.makarevich/spark-structured-streaming-and-java-util-zip-and-finalize-method-83181c6bc86f) - let me know if you want me to put it all here. I made a fix by running logic similar to existing in spark with `System.runFinalization()` and memory consumption normalized and the application can run for weeks, but it would be good if others had the same fix and didn't spend weeks finding the cause. If you want, I can add a configuration value controlling whether this logic should be run, so any potential bottlenecks could be avoided by existing users. ### Why are the changes needed? Java 8 environment with similar conditions(e.g. a lot of streaming queries in parallel with a lot of tasks) should have the same problem. ### Does this PR introduce _any_ user-facing change? It has no change the user will face, but in theory, with some circumstances client may see reduction in performance. ### How was this patch tested? I have an application running this in production. I don't expect any existing test to fail since this logic only improves the cleanup process and it's not covered by any existing test. I don't expect performance/benchmark to degrade(although can admit there may be problems with some combination of JVM/GC). I think there may be a way to check the finalization queue size from the code(similar to the JMX bean I used to catch the issue) to check that it has an effect, but it looks to be a very native operation that has no possibility to work incorrectly. ### Was this patch authored or co-authored using generative AI tooling? No -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
