Re: Which String deduplication option?

Keith Turner Mon, 08 Feb 2021 12:19:36 -0800

Recently while running some large map reduce jobs I learned that
Hadoop uses String.intern() in its RPC code (below is a link to an
example on one place where Hadoop does this).  I learned this because
when I ran jstack on NN, RM, and/or AM that were under distress
sometimes I kept seeing RPC server threads that were in
String.intern().  I never was quite sure if it was a problem though.
Not saying String.intern() is bad or good, just sharing something I
observed that I was uncertain about.


May make sense to create some sort of stress test that could simulate
the usage pattern of the TabletLocator and try the different options
and see what happens.  If any long pauses or problems happen in the
simulation, they may happen in the real environment.

https://github.com/apache/hadoop/blob/ba631c436b806728f8ec2f54ab1e289526c90579/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/TaskStatus.java#L481
https://github.com/apache/hadoop/blob/ba631c436b806728f8ec2f54ab1e289526c90579/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/StringInterner.java#L67

On Mon, Feb 1, 2021 at 9:55 PM Christopher <ctubb...@apache.org> wrote:
>
> While code reviewing, I saw that
> core/src/main/java/org/apache/accumulo/core/clientImpl/TabletLocator.java
> was using a WeakHashMap to deduplicate some strings.
>
> This code can probably be removed in favor of one of the following two 
> options:
>
> 1. Just explicitly use String.intern() - As of Java 7, there is no
> longer a separate, fixed-size PermGen space, so intern'd strings will
> be in the main heap, no longer constrained to a limited size pool.
> These strings are still subject to garbage collection. It is
> implemented as a HashMap internally (native implementation), with a
> default bucket size of more than 60K, plenty big enough for the
> interning that TabletLocator is doing... but this is configurable by
> the user with JVM flags if it's not. Interning will use less memory as
> WeakHashMap and similar performance, as long as the bucket size is big
> enough.
>
> 2. Just use -XX:+UseStringDeduplication JVM flag - as of Java 9, G1 is
> the new default Java garbage collector. This garbage collector has the
> option to automatically attempt to deduplicate all strings behind the
> scenes, by swapping out their underlying char arrays (so, it likely
> won't affect == equality because the String object references
> themselves won't change, unlike option 1). This is more passive than
> option 1, but would apply to the entire JVM. G1GC also implements some
> heuristics to prevent too much overhead.
>
> With both options, it's possible to output statistics.
>
> If I remove the WeakHashMap for the string deduplication in
> TabletLocator, does anybody have an opinion on which option I should
> replace it with? I'm leaning towards option 2 (adding it to
> assemble/conf/accumulo-env.sh as one of the default flags).

Re: Which String deduplication option?

Reply via email to