[
https://issues.apache.org/jira/browse/OOZIE-3232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16463321#comment-16463321
]
Misha Dmitriev commented on OOZIE-3232:
---------------------------------------
[~gezapeti] and [~andras.piros], here is one issue that I need your advice on.
There are many fields that need to be interned per the above analysis. If a
field can never be set to null, it's fine to replace e.g. {{this.path =
inPath;}} with {{this.path = inPath.intern();}} However, if {{inPath}} can ever
be set to null, the above code will fail. Hence in e.g. Hadoop and Hive we have
library methods that simply perform something like {{return s != null ?
s.intern() : null; }}We use such methods everywhere instead of a plain intern()
call to avoid any real or potential NPEs.
I expected to be able to use {{org.apache.hadoop.StringInterner.weakIntern()}},
that in the recent versions of Hadoop performs the proper call to
String.intern() as above . But turns out that Oozie still depends on some old
hadoop version, where the above {{weakIntern()}} method uses some old stuff,
essentially a custom interner class based on WeakHashMap. In JDK 7 and newer,
the built-inĀ {{String.intern()}} method performs much better than such custom
interners, and thus we should use it.
So basically here we have a choice of either:
# Upgrade the Hadoop dependency in oozie to a newer version. That may
introduce problems/incompatibilities somewhere (or may be long overdue :))
# Add oozie-internal class, something like {{org.apache.oozie.StringUtils}}
and stock it with the {{intern(String)}} and maybe some other methods like
{{internStringsInList(List<String>)}} etc.
What would you suggest?
> Oozie may waste up to 25% of the heap due to duplicate strings
> --------------------------------------------------------------
>
> Key: OOZIE-3232
> URL: https://issues.apache.org/jira/browse/OOZIE-3232
> Project: Oozie
> Issue Type: Improvement
> Components: core
> Reporter: Misha Dmitriev
> Assignee: Misha Dmitriev
> Priority: Major
> Attachments: jxray-analysis-dup-strings.png
>
>
> I've recently analyzed an Oozie heap dump obtained from a customer's
> production job, using jxray (www.jxray.com). In this job, Oozie's heap
> consumption is ~6GB, and it turns out that nearly 25% of it is wasted due to
> a large number of duplicate strings. The screenshot below, taken from the
> jxray report, illustrates the problem.
> !jxray-analysis-dup-strings.png|width=638,height=669!
> It turns out that a lot of duplicate strings come from the oozie's own code,
> i.e. {{org.apache.oozie.*}}. In particular, the top data field wasting memory
> is {{org.apache.oozie.StringBlob.string}} (wastes 2.6%, or ~160MB). >From the
> source code of {{StringBlob}} I see that it would be trivial to intern
> (deduplicate) these strings by adding the call to {{String.intern()}} in the
> constructor and a few other places. Similarly, various fields of
> {{org.apache.oozie.WorkflowJobBean}} and {{WorkflowActionBean}} collectively
> waste a lot of memory, and can be fixed in the same way.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)