[ 
https://issues.apache.org/jira/browse/HADOOP-5494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Devaraj Das updated HADOOP-5494:
--------------------------------

    Attachment: 5494-1.patch

This patch (an early one that still needs large scale testing) does the 
following:
1) Removes the method next(DataInputBuffer, DataInputBuffer) from the 
Merger.Segment class and the IFile.Reader classes
2) nextRawKey and nextRawValue are defined in those classes
3) nextRawValue is called in Merger.MergeQueue.next() and the DataInputBuffer 
passed is allocated memory then (true for IFile.Reader class's implementation; 
the other case IFile.InMemoryReader is the case where the value is in memory 
already)
4) Removes the IFile buffering that it does in addition to FileSystem's 
buffering. The FileSystem level buffering should be sufficient.

The other thing that can be done is to have one _next_ method that takes a _key 
DataInputBuffer_ and returns two things - the filled up _key DataInputBuffer_ 
and a _stream_ for the value (a DataInput) without actually allocating memory 
upfront for the value (again, true for only the IFile.Reader case). But that 
can probably be a follow up jira.

> IFile.Reader should have a nextRawKey/nextRawValue
> --------------------------------------------------
>
>                 Key: HADOOP-5494
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5494
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.18.0
>            Reporter: Devaraj Das
>            Assignee: Devaraj Das
>             Fix For: 0.21.0
>
>         Attachments: 5494-1.patch
>
>
> Merger.Segment has only the next() method defined which internally calls 
> next(key,value) on the underlying IFile stream. This would read both the key 
> and the value bytes. It would be good to have Merger.Segment.nextRawKey(), 
> that would read only the key and delay reading the value until needed (in 
> Merger.MergeQueue.next()) via a new method Merger.Segment.nextRawValue(). 
> This would mean that we load only one value bytes at a time, and hence would 
> incur potentially much less (depending on how big the values are) on the 
> memory footprint.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to