Yeah, I saw that, and replied to that point in my PR at
https://github.com/apache/accumulo/pull/1920#issuecomment-776795695

On Wed, Feb 10, 2021 at 12:16 PM David <dam6...@gmail.com> wrote:
>
> Hey,
>
> For what it's worth, the Guava team added that bit about "String.intern()
> has some well-known performance limitations, and should generally be
> avoided" was added on Oct 13, 2020.
>
> Thanks.
>
> On Mon, Feb 8, 2021 at 5:06 PM Christopher <ctubb...@apache.org> wrote:
>
> > Guava's argument in the linked comment appears to be based on
> > pre-Java8, before the PermGen space was consolidated with the main
> > heap and had a fixed size.
> >
> > In response to the other observations: a stress test here seems
> > particularly difficult to compare between String.intern and G1GC
> > deduplication, since the latter will deduplicate across the JVM, and
> > not just the TabletLocator stuff. So, we wouldn't be able to get a
> > good direct comparison between the overall impact between the two.
> >
> > As for running a stress test between WeakHashMap and String.intern,
> > just for TabletLocator, I'm not going to bother because others have
> > already done that work and determined them to be similar in
> > performance, given an adequate string table size (one is at
> > http://java-performance.info/string-intern-in-java-6-7-8/)
> >
> > So, what I will do is create a PR to replace the non-tunable
> > WeakHashMap with the tunable String.intern for TabletLocator, on the
> > basis that they have comparable performance, but the latter is
> > user-tunable if they need to, uses less memory, and involves less
> > code. I will not do anything with the G1GC settings, leaving that up
> > to users to experiment with and tune on their own, if they wish.
> >
> > On Mon, Feb 8, 2021 at 4:31 PM David <dam6...@gmail.com> wrote:
> > >
> > > Guava argues for the use of a weak hashmap.
> > >
> > >
> > https://github.com/google/guava/blob/master/guava/src/com/google/common/collect/Interner.java#L28-L30
> > >
> > > On Mon, Feb 8, 2021 at 3:57 PM Brian Loss <brianl...@gmail.com> wrote:
> > >
> > > > It might make sense to do both approaches.
> > > >
> > > > It seems there are limits to when -XX:+UseStringDeduplication takes
> > > > effect. By default, it only interns objects that have survived 3 GC
> > cycles,
> > > > although that number can be changed. If the objects in question are
> > > > short-lived, then it wouldn’t make sense to call String.intern on them
> > > > either. However, if we know based on the usage pattern that we’d get a
> > lot
> > > > of deduplication on long-lived strings, then String.intern is better
> > > > because it will happen right away and will also save more memory since
> > the
> > > > String object itself is de-duped (vs just the internal char array for
> > the
> > > > automatic de-duplication). It wasn’t completely clear from my reading,
> > but
> > > > if I understood correctly the other potential downside to
> > > > UseStringDeduplication is that it happens after GC if there’s time. On
> > a
> > > > heavily loaded system that doesn’t time left in the pause time goal
> > window
> > > > after completing GC, the string de-duplication might not happen at all.
> > > >
> > > > Adding -XX:+UseStringDeduplication wouldn’t hurt and could potentially
> > > > provide some benefit, so I’d be in favor of adding it. For
> > TabletLocator
> > > > specifically, if we know that’s an area where string de-duplication
> > will
> > > > help, then we should probably use String.intern there. As Keith
> > suggested,
> > > > a stress test might help determine whether it makes sense. In the
> > absence
> > > > of that, if we assume the previous WeakHashMap was there to solve a
> > > > specific problem (vs an uninformed attempt to save memory) then
> > > > String.intern sounds to me like the way to go as well.
> > > >
> > > > > On Feb 8, 2021, at 3:34 PM, Dave Marion <dmario...@gmail.com> wrote:
> > > > >
> > > > > String.intern() would seem to provide better coverage considering
> > that
> > > > some
> > > > > users may not use the G1 collector.
> > > > >
> > > > > On Mon, Feb 8, 2021 at 3:19 PM Keith Turner <ke...@deenlo.com>
> > wrote:
> > > > >
> > > > >> Recently while running some large map reduce jobs I learned that
> > > > >> Hadoop uses String.intern() in its RPC code (below is a link to an
> > > > >> example on one place where Hadoop does this).  I learned this
> > because
> > > > >> when I ran jstack on NN, RM, and/or AM that were under distress
> > > > >> sometimes I kept seeing RPC server threads that were in
> > > > >> String.intern().  I never was quite sure if it was a problem though.
> > > > >> Not saying String.intern() is bad or good, just sharing something I
> > > > >> observed that I was uncertain about.
> > > > >>
> > > > >> May make sense to create some sort of stress test that could
> > simulate
> > > > >> the usage pattern of the TabletLocator and try the different options
> > > > >> and see what happens.  If any long pauses or problems happen in the
> > > > >> simulation, they may happen in the real environment.
> > > > >>
> > > > >>
> > > > >>
> > > >
> > https://github.com/apache/hadoop/blob/ba631c436b806728f8ec2f54ab1e289526c90579/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/TaskStatus.java#L481
> > > > >>
> > > > >>
> > > >
> > https://github.com/apache/hadoop/blob/ba631c436b806728f8ec2f54ab1e289526c90579/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/StringInterner.java#L67
> > > > >>
> > > > >> On Mon, Feb 1, 2021 at 9:55 PM Christopher <ctubb...@apache.org>
> > wrote:
> > > > >>>
> > > > >>> While code reviewing, I saw that
> > > > >>>
> > > >
> > core/src/main/java/org/apache/accumulo/core/clientImpl/TabletLocator.java
> > > > >>> was using a WeakHashMap to deduplicate some strings.
> > > > >>>
> > > > >>> This code can probably be removed in favor of one of the following
> > two
> > > > >> options:
> > > > >>>
> > > > >>> 1. Just explicitly use String.intern() - As of Java 7, there is no
> > > > >>> longer a separate, fixed-size PermGen space, so intern'd strings
> > will
> > > > >>> be in the main heap, no longer constrained to a limited size pool.
> > > > >>> These strings are still subject to garbage collection. It is
> > > > >>> implemented as a HashMap internally (native implementation), with a
> > > > >>> default bucket size of more than 60K, plenty big enough for the
> > > > >>> interning that TabletLocator is doing... but this is configurable
> > by
> > > > >>> the user with JVM flags if it's not. Interning will use less
> > memory as
> > > > >>> WeakHashMap and similar performance, as long as the bucket size is
> > big
> > > > >>> enough.
> > > > >>>
> > > > >>> 2. Just use -XX:+UseStringDeduplication JVM flag - as of Java 9,
> > G1 is
> > > > >>> the new default Java garbage collector. This garbage collector has
> > the
> > > > >>> option to automatically attempt to deduplicate all strings behind
> > the
> > > > >>> scenes, by swapping out their underlying char arrays (so, it likely
> > > > >>> won't affect == equality because the String object references
> > > > >>> themselves won't change, unlike option 1). This is more passive
> > than
> > > > >>> option 1, but would apply to the entire JVM. G1GC also implements
> > some
> > > > >>> heuristics to prevent too much overhead.
> > > > >>>
> > > > >>> With both options, it's possible to output statistics.
> > > > >>>
> > > > >>> If I remove the WeakHashMap for the string deduplication in
> > > > >>> TabletLocator, does anybody have an opinion on which option I
> > should
> > > > >>> replace it with? I'm leaning towards option 2 (adding it to
> > > > >>> assemble/conf/accumulo-env.sh as one of the default flags).
> > > > >>
> > > >
> > > >
> >

Reply via email to