[ http://issues.apache.org/jira/browse/HADOOP-331?page=comments#action_12418346 ]
eric baldeschwieler commented on HADOOP-331: -------------------------------------------- I think we should do this in an incremental fashion, so... Each map task would continue to output its own state only. Although centralizing this per node is an interesting idea. I can see how this might be a real benefit in some cases. Perhaps you should file this enhancement idea. Ditto with the transmition to other nodes. Interesting, but complicated idea. Maybe you should file it. Think that can be taken up at a later date. Although feedback on how you would enhance this change to support such later work welcome. A spill is dumping the buffered output to disk when we accumulate enough info. Yes, something like a 100mB buffer seems right. (configurable possibly) I think the goal should be that each reduce only reads a single range. That will keep the client code simple an will keep us from thrashing as we scale. This may require some thought, since if you have a small number of reduce tasks, reading from multipule ranges may prove more seek efficient than doing the merge. If we do block compression for intermediates, you would need to align those to reduce targets, but I don't think we should try to do that in the first version of this. Especially given that this data will not be sorted the return to block compression may not be that great. (yes, I can construct counter examples, but let's deal with this as a seperate project). Checksums also need to be target aligned. > map outputs should be written to a single output file with an index > ------------------------------------------------------------------- > > Key: HADOOP-331 > URL: http://issues.apache.org/jira/browse/HADOOP-331 > Project: Hadoop > Type: Improvement > Components: mapred > Versions: 0.3.2 > Reporter: eric baldeschwieler > Assignee: Yoram Arnon > Fix For: 0.5.0 > > The current strategy of writing a file per target map is consuming a lot of > unused buffer space (causing out of memory crashes) and puts a lot of burden > on the FS (many opens, inodes used, etc). > I propose that we write a single file containing all output and also write an > index file IDing which byte range in the file goes to each reduce. This will > remove the issue of buffer waste, address scaling issues with number of open > files and generally set us up better for scaling. It will also have > advantages with very small inputs, since the buffer cache will reduce the > number of seeks needed and the data serving node can open a single file and > just keep it open rather than needing to do directory and open ops on every > request. > The only issue I see is that in cases where the task output is substantiallyu > larger than its input, we may need to spill multiple times. In this case, we > can do a merge after all spills are complete (or during the final spill). -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira