Github user countmdm commented on the issue:
https://github.com/apache/spark/pull/21456
I confirm that the -XX:+UseStringDeduplication option is available only
with G1 GC, and it is off by default. So if we decide to use it, I guess we
won't be able to enforce it reliably, especially for applications that just use
some library code from Spark (by the way, this issue was found in Yarn Node
Manager). Another issue with the string deduplication in G1 is that it's not
aggressive at all. It's done by one thread, that scans the entire heap when
other threads are loaded lightly enough. In the case that we investigated, the
number of duplicate strings was high, yet they were relatively short-lived.
That is, they survived for enough time to create significant pressure on the
GC, but I don't think that this timeframe would be enough for the deduplication
thread to eliminate them.
In summary, this kind of targeted explicit string deduplication is not
uncommon at all, and works really well. Usually you just need to add the
.intern() call in a few places in the code. What I had to do here is quite
involved because of the extra problem with java.io.File.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]