[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12799632#action_12799632
 ] 

Zheng Shao commented on MAPREDUCE-1374:
---------------------------------------

This experiment is done on hadoop-0.20. It shows JobClient memory usage by 
submitting a map-reduce job with around 200K mappers:

jmap before using this patch: (OOM before getting to the same stage as the 
second example)
{code}
 num     #instances         #bytes  class name
----------------------------------------------
   1:        188870       18107344  [C
   2:        242616        9704640  java.lang.String
   3:         42850        6543408  <constMethodKlass>
   4:         73218        5271696  
org.apache.hadoop.hive.ql.io.HiveInputFormat$HiveInputSplit
   5:         42850        5151504  <methodKlass>
   6:          3570        4693192  <constantPoolKlass>
   7:         72077        3647360  <symbolKlass>
   8:         73307        3518736  org.apache.hadoop.mapred.FileSplit
   9:         75424        3075008  [Ljava.lang.String;
  10:          3570        2818968  <instanceKlassKlass>
  11:          2741        2524096  <constantPoolCacheKlass>
...
  14:         10069        1449936  java.net.URI
...
  23:         10065         241560  org.apache.hadoop.fs.Path
{code}


jmap after this patch:
{code}
 num     #instances         #bytes  class name
----------------------------------------------
   1:        199014       14329008  
org.apache.hadoop.hive.ql.io.HiveInputFormat$HiveInputSplit
   2:        201801        9818856  [Ljava.lang.String;
   3:        199684        9584832  org.apache.hadoop.mapred.FileSplit
   4:         56594        8211632  [C
   5:         42851        6543872  <constMethodKlass>
   6:         42851        5151624  <methodKlass>
   7:          3570        4693616  <constantPoolKlass>
   8:         72091        3648368  <symbolKlass>
   9:          3570        2818968  <instanceKlassKlass>
  10:          2517        2675256  [Ljava.lang.Object;
  11:          4763        2531104  [I
  12:          2741        2524320  <constantPoolCacheKlass>
  13:         62275        2491000  java.lang.String
...
  31:           456          65664  java.net.URI
...
  69:           452          10848  org.apache.hadoop.fs.Path
{code}


String:FileSplit ratio:
before this patch: 3.3 : 1
after this patch: 0.3 : 1

We reduced the number of String object by 10 times!


> Reduce memory footprint of FileSplit
> ------------------------------------
>
>                 Key: MAPREDUCE-1374
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1374
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>    Affects Versions: 0.20.1, 0.21.0, 0.22.0
>            Reporter: Zheng Shao
>            Assignee: Zheng Shao
>
> We can have many FileInput objects in the memory, depending on the number of 
> mappers.
> It will save tons of memory on JobTracker and JobClient if we intern those 
> Strings for host names.
> {code}
> FileInputFormat.java:
>       for (NodeInfo host: hostList) {
>         // Strip out the port number from the host name
> -        retVal[index++] = host.node.getName().split(":")[0];
> +        retVal[index++] = host.node.getName().split(":")[0].intern();
>         if (index == replicationFactor) {
>           done = true;
>           break;
>         }
>       }
> {code}
> More on String.intern(): 
> http://www.javaworld.com/javaworld/javaqa/2003-12/01-qa-1212-intern.html
> It will also save a lot of memory by changing the class of {{file}} from 
> {{Path}} to {{String}}. {{Path}} contains a {{java.net.URI}} which internally 
> contains ~10 String fields. This will also be a huge saving.
> {code}
>   private Path file;
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to