Re: Which String deduplication option?

Brian Loss Mon, 08 Feb 2021 12:57:03 -0800

It might make sense to do both approaches.

It seems there are limits to when -XX:+UseStringDeduplication takes effect. By 
default, it only interns objects that have survived 3 GC cycles, although that 
number can be changed. If the objects in question are short-lived, then it 
wouldn’t make sense to call String.intern on them either. However, if we know 
based on the usage pattern that we’d get a lot of deduplication on long-lived 
strings, then String.intern is better because it will happen right away and 
will also save more memory since the String object itself is de-duped (vs just 
the internal char array for the automatic de-duplication). It wasn’t completely 
clear from my reading, but if I understood correctly the other potential 
downside to UseStringDeduplication is that it happens after GC if there’s time. 
On a heavily loaded system that doesn’t time left in the pause time goal window 
after completing GC, the string de-duplication might not happen at all.


Adding -XX:+UseStringDeduplication wouldn’t hurt and could potentially provide 
some benefit, so I’d be in favor of adding it. For TabletLocator specifically, 
if we know that’s an area where string de-duplication will help, then we should 
probably use String.intern there. As Keith suggested, a stress test might help 
determine whether it makes sense. In the absence of that, if we assume the 
previous WeakHashMap was there to solve a specific problem (vs an uninformed 
attempt to save memory) then String.intern sounds to me like the way to go as 
well.

> On Feb 8, 2021, at 3:34 PM, Dave Marion <[email protected]> wrote:
> 
> String.intern() would seem to provide better coverage considering that some
> users may not use the G1 collector.
> 
> On Mon, Feb 8, 2021 at 3:19 PM Keith Turner <[email protected]> wrote:
> 
>> Recently while running some large map reduce jobs I learned that
>> Hadoop uses String.intern() in its RPC code (below is a link to an
>> example on one place where Hadoop does this).  I learned this because
>> when I ran jstack on NN, RM, and/or AM that were under distress
>> sometimes I kept seeing RPC server threads that were in
>> String.intern().  I never was quite sure if it was a problem though.
>> Not saying String.intern() is bad or good, just sharing something I
>> observed that I was uncertain about.
>> 
>> May make sense to create some sort of stress test that could simulate
>> the usage pattern of the TabletLocator and try the different options
>> and see what happens.  If any long pauses or problems happen in the
>> simulation, they may happen in the real environment.
>> 
>> 
>> https://github.com/apache/hadoop/blob/ba631c436b806728f8ec2f54ab1e289526c90579/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/TaskStatus.java#L481
>> 
>> https://github.com/apache/hadoop/blob/ba631c436b806728f8ec2f54ab1e289526c90579/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/StringInterner.java#L67
>> 
>> On Mon, Feb 1, 2021 at 9:55 PM Christopher <[email protected]> wrote:
>>> 
>>> While code reviewing, I saw that
>>> core/src/main/java/org/apache/accumulo/core/clientImpl/TabletLocator.java
>>> was using a WeakHashMap to deduplicate some strings.
>>> 
>>> This code can probably be removed in favor of one of the following two
>> options:
>>> 
>>> 1. Just explicitly use String.intern() - As of Java 7, there is no
>>> longer a separate, fixed-size PermGen space, so intern'd strings will
>>> be in the main heap, no longer constrained to a limited size pool.
>>> These strings are still subject to garbage collection. It is
>>> implemented as a HashMap internally (native implementation), with a
>>> default bucket size of more than 60K, plenty big enough for the
>>> interning that TabletLocator is doing... but this is configurable by
>>> the user with JVM flags if it's not. Interning will use less memory as
>>> WeakHashMap and similar performance, as long as the bucket size is big
>>> enough.
>>> 
>>> 2. Just use -XX:+UseStringDeduplication JVM flag - as of Java 9, G1 is
>>> the new default Java garbage collector. This garbage collector has the
>>> option to automatically attempt to deduplicate all strings behind the
>>> scenes, by swapping out their underlying char arrays (so, it likely
>>> won't affect == equality because the String object references
>>> themselves won't change, unlike option 1). This is more passive than
>>> option 1, but would apply to the entire JVM. G1GC also implements some
>>> heuristics to prevent too much overhead.
>>> 
>>> With both options, it's possible to output statistics.
>>> 
>>> If I remove the WeakHashMap for the string deduplication in
>>> TabletLocator, does anybody have an opinion on which option I should
>>> replace it with? I'm leaning towards option 2 (adding it to
>>> assemble/conf/accumulo-env.sh as one of the default flags).
>>

Re: Which String deduplication option?

Reply via email to