[
https://issues.apache.org/jira/browse/HADOOP-1027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12475605
]
David Bowen commented on HADOOP-1027:
-------------------------------------
> This hopefully addresses the coding convention related comment.
Indeed it does. Thank-you.
> Fix the RAM FileSystem/Merge problems (reported in HADOOP-1014)
> ---------------------------------------------------------------
>
> Key: HADOOP-1027
> URL: https://issues.apache.org/jira/browse/HADOOP-1027
> Project: Hadoop
> Issue Type: Bug
> Components: mapred
> Reporter: Devaraj Das
> Assigned To: Devaraj Das
> Attachments: 1027-new.patch, 1027-new2.patch, 1027-new3.patch,
> 1027-new4.patch
>
>
> 1) Merge algorithm implementation does not delete empty segments (sequence
> files with no key/val data) in cases where single level merges don't happen
> on those segments (due to the check "numberOfSegmentsRemaining <= factor"
> returning true). This affected the in-mem merge in a subtle way :-
> For the in-mem merge, the merge-spill-file is given the same name as the name
> of the 0th entry file in the ramfs. If this file was an empty file, then it
> would not get deleted from the ramfs, and if the subsequent merge on ramfs
> chose the same name for the merge-spill-file, it would overwrite the
> previously created spill. This led to the inconsistent output sizes.
> 2) The InMemoryFileSystem has a "close" method which is not protected (only
> method where pathToFileAttribs map is modified without first locking the
> InMemoryFileSystem instance) and that quite likely leads to
> ConcurrentModificationException if some thread calls InMemoryFileSystem.close
> (due to some exception) and some other thread is in the process of doing
> InMemoryFileSystem.getFiles(). However, this problem will not affect the
> correctness of the merge process (anyway the task is going to fail) and the
> more important thing is that some other exception happened (like insufficient
> disk space and so map outputs could not be written) which may not be related
> to the merge process at all.
> 3) The number of outputs that is merged at once in RAM should be limited.
> This is to prevent OutOfMemory errors. Consider a case where there are 10s of
> thousands of maps and all maps generate empty outputs. Given the default size
> of the RAM FS as 75 MB, we can possibly accomodate lots of map outputs in RAM
> without doing any merge but it also results in the various other data
> structures exploding in size. We have to do a trade off here especially
> because the inmem-merging is done in the TaskTracker process which already is
> under a good amount of memory pressure.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.