[jira] Updated: (HADOOP-1027) Fix the RAM FileSystem/Merge problems (reported in HADOOP-1014)

Devaraj Das (JIRA) Fri, 23 Feb 2007 22:19:26 -0800

     [ 
https://issues.apache.org/jira/browse/HADOOP-1027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Devaraj Das updated HADOOP-1027:
--------------------------------

    Attachment: 1027-new4.patch

This hopefully addresses the coding convention related comment.

> Fix the RAM FileSystem/Merge problems (reported in HADOOP-1014)
> ---------------------------------------------------------------
>
>                 Key: HADOOP-1027
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1027
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Devaraj Das
>         Assigned To: Devaraj Das
>         Attachments: 1027-new.patch, 1027-new2.patch, 1027-new3.patch, 
> 1027-new4.patch
>
>
> 1) Merge algorithm implementation does not delete empty segments (sequence 
> files with no key/val data) in cases where single level merges don't happen 
> on those segments (due to the check "numberOfSegmentsRemaining <= factor" 
> returning true). This affected the in-mem merge in a subtle way :-
> For the in-mem merge, the merge-spill-file is given the same name as the name 
> of the 0th entry file in the ramfs. If this file was an empty file, then it 
> would not get deleted from the ramfs, and if the subsequent merge on ramfs 
> chose the same name for the merge-spill-file, it would overwrite the 
> previously created spill. This led to the inconsistent output sizes.
> 2) The InMemoryFileSystem has a "close" method which is not protected (only 
> method where pathToFileAttribs map is modified without first locking the 
> InMemoryFileSystem instance) and that quite likely leads to 
> ConcurrentModificationException if some thread calls InMemoryFileSystem.close 
> (due to some exception) and some other thread is in the process of doing 
> InMemoryFileSystem.getFiles(). However, this problem will not affect the 
> correctness of the merge process (anyway the task is going to fail) and the 
> more important thing is that some other exception happened (like insufficient 
> disk space and so map outputs could not be written) which may not be related 
> to the merge process at all.
> 3) The number of outputs that is merged at once in RAM should be limited. 
> This is to prevent OutOfMemory errors. Consider a case where there are 10s of 
> thousands of maps and all maps generate empty outputs. Given the default size 
> of the RAM FS as 75 MB, we can possibly accomodate lots of map outputs in RAM 
> without doing any merge but it also results in the various other data 
> structures exploding in size. We have to do a trade off here especially 
> because the inmem-merging is done in the TaskTracker process which already is 
> under a good amount of memory pressure.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1027) Fix the RAM FileSystem/Merge problems (reported in HADOOP-1014)

Reply via email to