[GitHub] spark issue #21456: [SPARK-24356] [CORE] Duplicate strings in File.path mana...

countmdm Wed, 30 May 2018 23:49:44 -0700

Github user countmdm commented on the issue:

    https://github.com/apache/spark/pull/21456
  
    I confirm that the -XX:+UseStringDeduplication option is available only 
with G1 GC, and it is off by default. So if we decide to use it, I guess we 
won't be able to enforce it reliably, especially for applications that just use 
some library code from Spark (by the way, this issue was found in Yarn Node 
Manager). Another issue with the string deduplication in G1 is that it's not 
aggressive at all. It's done by one thread, that scans the entire heap when 
other threads are loaded lightly enough. In the case that we investigated, the 
number of duplicate strings was high, yet they were relatively short-lived. 
That is, they survived for enough time to create significant pressure on the 
GC, but I don't think that this timeframe would be enough for the deduplication 
thread to eliminate them.
    
    In summary, this kind of targeted explicit string deduplication is not 
uncommon at all, and works really well. Usually you just need to add the 
.intern() call in a few places in the code. What I had to do here is quite 
involved because of the extra problem with java.io.File.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #21456: [SPARK-24356] [CORE] Duplicate strings in File.path mana...

Reply via email to