[ 
https://issues.apache.org/jira/browse/OOZIE-3232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16463321#comment-16463321
 ] 

Misha Dmitriev commented on OOZIE-3232:
---------------------------------------

[~gezapeti] and [~andras.piros], here is one issue that I need your advice on.

There are many fields that need to be interned per the above analysis. If a 
field can never be set to null, it's fine to replace e.g. {{this.path = 
inPath;}} with {{this.path = inPath.intern();}} However, if {{inPath}} can ever 
be set to null, the above code will fail. Hence in e.g. Hadoop and Hive we have 
library methods that simply perform something like {{return s != null ? 
s.intern() : null; }}We use such methods everywhere instead of a plain intern() 
call to avoid any real or potential NPEs.

I expected to be able to use {{org.apache.hadoop.StringInterner.weakIntern()}}, 
that in the recent versions of Hadoop performs the proper call to 
String.intern() as above . But turns out that Oozie still depends on some old 
hadoop version, where the above {{weakIntern()}} method uses some old stuff, 
essentially a custom interner class based on WeakHashMap. In JDK 7 and newer, 
the built-inĀ {{String.intern()}} method performs much better than such custom 
interners, and thus we should use it.

So basically here we have a choice of either:
 # Upgrade the Hadoop dependency in oozie to a newer version. That may 
introduce problems/incompatibilities somewhere (or may be long overdue :))
 # Add oozie-internal class, something like {{org.apache.oozie.StringUtils}} 
and stock it with the {{intern(String)}} and maybe some other methods like 
{{internStringsInList(List<String>)}} etc.

What would you suggest?

> Oozie may waste up to 25% of the heap due to duplicate strings
> --------------------------------------------------------------
>
>                 Key: OOZIE-3232
>                 URL: https://issues.apache.org/jira/browse/OOZIE-3232
>             Project: Oozie
>          Issue Type: Improvement
>          Components: core
>            Reporter: Misha Dmitriev
>            Assignee: Misha Dmitriev
>            Priority: Major
>         Attachments: jxray-analysis-dup-strings.png
>
>
> I've recently analyzed an Oozie heap dump obtained from a customer's 
> production job, using jxray (www.jxray.com). In this job, Oozie's heap 
> consumption is ~6GB, and it turns out that nearly 25% of it is wasted due to 
> a large number of duplicate strings. The screenshot below, taken from the 
> jxray report, illustrates the problem.
> !jxray-analysis-dup-strings.png|width=638,height=669!
> It turns out that a lot of duplicate strings come from the oozie's own code, 
> i.e. {{org.apache.oozie.*}}. In particular, the top data field wasting memory 
> is {{org.apache.oozie.StringBlob.string}} (wastes 2.6%, or ~160MB). >From the 
> source code of {{StringBlob}} I see that it would be trivial to intern 
> (deduplicate) these strings by adding the call to {{String.intern()}} in the 
> constructor and a few other places. Similarly, various fields of 
> {{org.apache.oozie.WorkflowJobBean}} and {{WorkflowActionBean}} collectively 
> waste a lot of memory, and can be fixed in the same way.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to