[ 
http://issues.apache.org/jira/browse/HADOOP-331?page=comments#action_12442793 ] 
            
Devaraj Das commented on HADOOP-331:
------------------------------------

I would ideally like to use as much of the existing code/tools as possible. So, 
here is another proposal:
1) Generate a master index as the buffers are filled with key/value pairs. The 
master index is an array indexed by partition number and each entry in the 
array contain values of the form <filename, keyoffset, valoffset>*.  The 
filename refers to the filename where the containing buffer was spilled. Think 
of this as each element in the array is an array by itself.
2) Filled buffers are spilled and NO sorting is done when a spill happens.
3) At the end, we have a master index and we go through that index by index ( 
partition by partition). Since we know the filename and the key/value offsets, 
we can access the key/value data with ease. We create a temporary fixed-sized 
in-memory buffer for each array element and if we can accomodate all the 
key/value pairs for the particular array element in that buffer, well and good; 
we sort that in-memory buffer and append the sorted data to a final output 
file. If we cannot fit in everything into the buffer, we spill and sort the 
spills after we spill all the key/value pairs.
4) While we do (3), we can trivially generate the final index file.
5) At the end of the step (3), we have a big file with sorted data (by 
partition) and a single index file. 
The problems of implicit partition numbers are not there with this approach. 
Also, we can do parallel sorts (like two at a time or something) of the array 
elements and generate multiple files and just concatenate the sorted data at 
the end of the sorts to that single output file. The negative with this 
approach is that we need another temp buffer (but it may be required in the 
other approaches also). Think the master index file won't be very big (it will 
be roughly of the order of number of records processed by a single map).
Thoughts ?

> map outputs should be written to a single output file with an index
> -------------------------------------------------------------------
>
>                 Key: HADOOP-331
>                 URL: http://issues.apache.org/jira/browse/HADOOP-331
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.3.2
>            Reporter: eric baldeschwieler
>         Assigned To: Devaraj Das
>
> The current strategy of writing a file per target map is consuming a lot of 
> unused buffer space (causing out of memory crashes) and puts a lot of burden 
> on the FS (many opens, inodes used, etc).  
> I propose that we write a single file containing all output and also write an 
> index file IDing which byte range in the file goes to each reduce.  This will 
> remove the issue of buffer waste, address scaling issues with number of open 
> files and generally set us up better for scaling.  It will also have 
> advantages with very small inputs, since the buffer cache will reduce the 
> number of seeks needed and the data serving node can open a single file and 
> just keep it open rather than needing to do directory and open ops on every 
> request.
> The only issue I see is that in cases where the task output is substantiallyu 
> larger than its input, we may need to spill multiple times.  In this case, we 
> can do a merge after all spills are complete (or during the final spill).

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to