When there are few reducers, sorting should be done by mappers
--------------------------------------------------------------
Key: HADOOP-717
URL: http://issues.apache.org/jira/browse/HADOOP-717
Project: Hadoop
Issue Type: Improvement
Components: mapred
Reporter: arkady borkovsky
If I understand correctly, currently, sort happens on the reducer side.
So if few hundred mappers produce few (or many) Gig of data, and there is just
ONE reduce to consume it, copying and sorting takes forever.
It may make sense to have a special case optimization for a single reducer.
(E.g. "when there is only reducer and many mappers, sort is done by the
mappers, and reducer does only a merge")
Or to have some smarter policy that makes sure that sorting uses as many CPUs
as it makes sense. If the map step has produced data on all the nodes of the
cluster, it makes sense to use all the nodes for sorting.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira