[ 
http://issues.apache.org/jira/browse/HADOOP-331?page=comments#action_12418346 ] 

eric baldeschwieler commented on HADOOP-331:
--------------------------------------------

I think we should do this in an incremental fashion, so...

Each map task would continue to output its own state only.  Although 
centralizing this per node is an interesting idea.  I can see how this might be 
a real benefit in some cases.  Perhaps you should file this enhancement idea.

Ditto with the transmition to other nodes.  Interesting, but complicated idea.  
Maybe you should file it.  Think that can be taken up at a later date.  
Although feedback on how you would enhance this change to support such later 
work welcome.

A spill is dumping the buffered output to disk when we accumulate enough info.  
Yes, something like a 100mB buffer seems right.  (configurable possibly)

I think the goal should be that each reduce only reads a single range.  That 
will keep the client code simple an will keep us from thrashing as we scale.  
This may require some thought, since if you have a small number of reduce 
tasks, reading from multipule ranges may prove more seek efficient than doing 
the merge.

If we do block compression for intermediates, you would need to align those to 
reduce targets, but I don't think we should try to do that in the first version 
of this.  Especially given that this data will not be sorted the return to 
block compression may not be that great.  (yes, I can construct counter 
examples, but let's deal with this as a seperate project).  Checksums also need 
to be target aligned.

> map outputs should be written to a single output file with an index
> -------------------------------------------------------------------
>
>          Key: HADOOP-331
>          URL: http://issues.apache.org/jira/browse/HADOOP-331
>      Project: Hadoop
>         Type: Improvement

>   Components: mapred
>     Versions: 0.3.2
>     Reporter: eric baldeschwieler
>     Assignee: Yoram Arnon
>      Fix For: 0.5.0

>
> The current strategy of writing a file per target map is consuming a lot of 
> unused buffer space (causing out of memory crashes) and puts a lot of burden 
> on the FS (many opens, inodes used, etc).  
> I propose that we write a single file containing all output and also write an 
> index file IDing which byte range in the file goes to each reduce.  This will 
> remove the issue of buffer waste, address scaling issues with number of open 
> files and generally set us up better for scaling.  It will also have 
> advantages with very small inputs, since the buffer cache will reduce the 
> number of seeks needed and the data serving node can open a single file and 
> just keep it open rather than needing to do directory and open ops on every 
> request.
> The only issue I see is that in cases where the task output is substantiallyu 
> larger than its input, we may need to spill multiple times.  In this case, we 
> can do a merge after all spills are complete (or during the final spill).

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply via email to