[ http://issues.apache.org/jira/browse/HADOOP-200?page=comments#action_12378484 ]
Doug Cutting commented on HADOOP-200: ------------------------------------- +1 this sounds good to me. > The map task names are sent to the reduces > ------------------------------------------ > > Key: HADOOP-200 > URL: http://issues.apache.org/jira/browse/HADOOP-200 > Project: Hadoop > Type: Bug > Components: mapred > Versions: 0.2 > Reporter: Owen O'Malley > Assignee: Owen O'Malley > Fix For: 0.3 > > As each reduce is created, it is given the entire set of potential map names. > For my large sort jobs with 64k maps, this means that each reduce task is > given a two dimensional array that is 5 tasks/map * 64k maps = 320k strings. > Since the reduce task is passed from the job tracker to the task tracker and > down to the task runner, passing the entire list is very expensive. I suspect > that this is the cause of the slow downs that I see in the task trackers > heart beats when the reduce tasks are being launched. > I propose that the ReduceTask be changed to just get the count of maps, with > ids from 0 .. maps -1. > public ReduceTask(String jobFile, String taskId, int maps, int partition); > Then we need to change the protocol for finding map outputs: > MapOutputLocation[] locateMapOutputs(String jobId, int[] mapIds, int > partition); -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
