VitoMakarevich opened a new pull request, #42893:
URL: https://github.com/apache/spark/pull/42893

   ### What changes were proposed in this pull request?
   The description is listed in [Jira 
SPARK-44459](https://issues.apache.org/jira/browse/SPARK-44459).
   I faced an issue when running ~90 streaming queries in parallel in one 
driver. Driver memory footprint(max heap and used heap) was constantly 
increasing until reached the node limit(in my case ~110 GBs) and the driver got 
killed. When I checked the heap dump, I observed hundred of millions of 
`java.lang.ref.Finalizer` and `java.util.zip.*` entries. When I dug into the 
code, I found that on Java 8 - many classes from `java.util.zip.*` include 
non-empty `finalize` method.
   This, in turn, leads to accumulation of references to classes to be GCed, 
and only `finalizer` thread(**one per JVM**) can run the non-empty `finalize` 
method, and only then memory will be released by GC. Also, many classes from 
`java.util.zip.*` hold references to native(C) memory, so the situation is even 
worse since not only heap memory is accumulating. I verified it's an issue with 
finalization queue when checked JMX bean related to finalization queue size.
   I ensured this comes from the Spark package by finding a problematic part of 
Spark - when I turned off SparkUI - the problem disappeared. As per my 
findings, the issue comes from `netty` using `java.util.zip` heavily, and a 
large number of tasks leads to an enormous production rate, so in multithreaded 
code `finalizer` thread could not keep up with the increasing `Reference` queue.
   There are [detailed article with 
figures](https://medium.com/@vitali.makarevich/spark-structured-streaming-and-java-util-zip-and-finalize-method-83181c6bc86f)
 - let me know if you want me to put it all here.
   I made a fix by running logic similar to existing in spark with 
`System.runFinalization()` and memory consumption normalized and the 
application can run for weeks, but it would be good if others had the same fix 
and didn't spend weeks finding the cause.
   If you want, I can add a configuration value controlling whether this logic 
should be run, so any potential bottlenecks could be avoided by existing users.
   
   
   ### Why are the changes needed?
   Java 8 environment with similar conditions(e.g. a lot of streaming queries 
in parallel with a lot of tasks) should have the same problem.
   
   
   ### Does this PR introduce _any_ user-facing change?
   It has no change the user will face, but in theory, with some circumstances 
client may see reduction in performance.
   
   
   ### How was this patch tested?
   I have an application running this in production. I don't expect any 
existing test to fail since this logic only improves the cleanup process and 
it's not covered by any existing test. I don't expect performance/benchmark to 
degrade(although can admit there may be problems with some combination of 
JVM/GC).
   I think there may be a way to check the finalization queue size from the 
code(similar to the JMX bean I used to catch the issue) to check that it has an 
effect, but it looks to be a very native operation that has no possibility to 
work incorrectly.
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to