[
https://issues.apache.org/jira/browse/MAPREDUCE-1374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Chris Douglas updated MAPREDUCE-1374:
-------------------------------------
Status: Open (was: Patch Available)
* The unit test mixes JUnit3 and JUnit4; instead of extending {{TestCase}},
statically importing the asserts is consistent.
* I agree with Todd/Amar/Tom on using a {{WeakHashMap}} instead of
{{String::intern}} for the hosts. The guarantees offered by the latter are much
stronger what is required to support this case.
* Using {{String::intern}} for the input path is taking a good idea too far;
for long-running clients submitting many jobs, the cache footprint could be
excessive. Further, if the file is splittable, creating several splits with the
same (immutable) {{Path}} reference is pretty cheap. The space savings effected
by making this member a {{String}} do not seem very compelling.
* If your tests suggest that caching input paths is important, then keeping a
{{WeakHashMap<Path,String>}} would avoid the overhead of {{URI::toString}} and
the temporary objects it creates (as opposed to computing the result and then
looking it up in the cache).
> Reduce memory footprint of FileSplit
> ------------------------------------
>
> Key: MAPREDUCE-1374
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1374
> Project: Hadoop Map/Reduce
> Issue Type: Improvement
> Affects Versions: 0.20.1, 0.21.0, 0.22.0
> Reporter: Zheng Shao
> Assignee: Zheng Shao
> Fix For: 0.21.0, 0.22.0
>
> Attachments: MAPREDUCE-1374.1.patch, MAPREDUCE-1374.2.patch,
> MAPREDUCE-1374.3.patch
>
>
> We can have many FileInput objects in the memory, depending on the number of
> mappers.
> It will save tons of memory on JobTracker and JobClient if we intern those
> Strings for host names.
> {code}
> FileInputFormat.java:
> for (NodeInfo host: hostList) {
> // Strip out the port number from the host name
> - retVal[index++] = host.node.getName().split(":")[0];
> + retVal[index++] = host.node.getName().split(":")[0].intern();
> if (index == replicationFactor) {
> done = true;
> break;
> }
> }
> {code}
> More on String.intern():
> http://www.javaworld.com/javaworld/javaqa/2003-12/01-qa-1212-intern.html
> It will also save a lot of memory by changing the class of {{file}} from
> {{Path}} to {{String}}. {{Path}} contains a {{java.net.URI}} which internally
> contains ~10 String fields. This will also be a huge saving.
> {code}
> private Path file;
> {code}
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.