[ https://issues.apache.org/jira/browse/MAPREDUCE-1374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12799632#action_12799632 ]
Zheng Shao commented on MAPREDUCE-1374: --------------------------------------- This experiment is done on hadoop-0.20. It shows JobClient memory usage by submitting a map-reduce job with around 200K mappers: jmap before using this patch: (OOM before getting to the same stage as the second example) {code} num #instances #bytes class name ---------------------------------------------- 1: 188870 18107344 [C 2: 242616 9704640 java.lang.String 3: 42850 6543408 <constMethodKlass> 4: 73218 5271696 org.apache.hadoop.hive.ql.io.HiveInputFormat$HiveInputSplit 5: 42850 5151504 <methodKlass> 6: 3570 4693192 <constantPoolKlass> 7: 72077 3647360 <symbolKlass> 8: 73307 3518736 org.apache.hadoop.mapred.FileSplit 9: 75424 3075008 [Ljava.lang.String; 10: 3570 2818968 <instanceKlassKlass> 11: 2741 2524096 <constantPoolCacheKlass> ... 14: 10069 1449936 java.net.URI ... 23: 10065 241560 org.apache.hadoop.fs.Path {code} jmap after this patch: {code} num #instances #bytes class name ---------------------------------------------- 1: 199014 14329008 org.apache.hadoop.hive.ql.io.HiveInputFormat$HiveInputSplit 2: 201801 9818856 [Ljava.lang.String; 3: 199684 9584832 org.apache.hadoop.mapred.FileSplit 4: 56594 8211632 [C 5: 42851 6543872 <constMethodKlass> 6: 42851 5151624 <methodKlass> 7: 3570 4693616 <constantPoolKlass> 8: 72091 3648368 <symbolKlass> 9: 3570 2818968 <instanceKlassKlass> 10: 2517 2675256 [Ljava.lang.Object; 11: 4763 2531104 [I 12: 2741 2524320 <constantPoolCacheKlass> 13: 62275 2491000 java.lang.String ... 31: 456 65664 java.net.URI ... 69: 452 10848 org.apache.hadoop.fs.Path {code} String:FileSplit ratio: before this patch: 3.3 : 1 after this patch: 0.3 : 1 We reduced the number of String object by 10 times! > Reduce memory footprint of FileSplit > ------------------------------------ > > Key: MAPREDUCE-1374 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-1374 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Affects Versions: 0.20.1, 0.21.0, 0.22.0 > Reporter: Zheng Shao > Assignee: Zheng Shao > > We can have many FileInput objects in the memory, depending on the number of > mappers. > It will save tons of memory on JobTracker and JobClient if we intern those > Strings for host names. > {code} > FileInputFormat.java: > for (NodeInfo host: hostList) { > // Strip out the port number from the host name > - retVal[index++] = host.node.getName().split(":")[0]; > + retVal[index++] = host.node.getName().split(":")[0].intern(); > if (index == replicationFactor) { > done = true; > break; > } > } > {code} > More on String.intern(): > http://www.javaworld.com/javaworld/javaqa/2003-12/01-qa-1212-intern.html > It will also save a lot of memory by changing the class of {{file}} from > {{Path}} to {{String}}. {{Path}} contains a {{java.net.URI}} which internally > contains ~10 String fields. This will also be a huge saving. > {code} > private Path file; > {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.