Doug's calculation shows that the total gain can be only 1/3 (15 are
unavoidable, and taking advantage of largely pre-sorted input reduces
overhead from 12/27 to 3/18, so the maximum total gain is 27->18.)
Does this model assume that the size of the output of reduce is similar
to the size of the input?
An important class of applications (mentioned in this thread before)
uses two inputs:
-- M ("master file") -- very large, presorted and not changing from run
to run,
-- D ("details file") -- smaller, different from run to run, not
necessarily presorted
and the output size is proportional to the size of D.
In this case the gain from "no-sort" may be much higher, as the 13
"transfer and write" to DFS are applied to a smaller amount of data,
while 11 (b-d) sort-n-shuffle-related are saved on the larger data).
On Jan 25, 2007, at 5:21 PM, Doug Cutting (JIRA) wrote:
[
https://issues.apache.org/jira/browse/HADOOP-939?
page=com.atlassian.jira.plugin.system.issuetabpanels:comment-
tabpanel#action_12467717 ]
Doug Cutting commented on HADOOP-939:
-------------------------------------
I suspect that most of the performance gains to be had by declaring
input to be sorted can also be had by using heuristics that also speed
things when input is only nearly sorted. (By "nearly sorted" I mean
things like merging a set of updates into a sorted database, e.g., the
crawl db update task in Nutch.)
Eric Baldeschwieler proposed a simple model for MapReduce performance.
If you assume that disks can read and write at 100MB/s, and that
nodes can talk within rack at 100MB/s (Gb/s) and to nodes in another
rack at 10MB/s, then a MapReduce requires the following number of
seconds per 100MB. (Note that this assumes various sort optimizations
that are already in progress, where map outputs are buffered and
sorted before they're spilled to the local disk on map nodes, and
reduce inputs are buffered and merged before they're spilled to the
local disk on the reduce node, so that, in many cases, reduce can
proceed without an explicit sort stage but simply by merging a set of
already sorted input files from the local disk.)
a. 1 read input data from local drive on map node
[ map ]
b. 1 write batches of sorted output data to temporary file on map node
c. 10 shuffle batches of sorted data to reduce node
d. 1 write batches of sorted data to reduce node
[ reduce]
e. 1 write one copy of output locally
f. 2 transfer and write one copy to another node on the same rack
g. 11 transfer and write one copy to an off-rack node
So the total is 27s/100MB. Only two of those are really
sort-specific, (b) and (d). 14 (more than half) are unavoidable.
The biggest chunk of fat to go after for pre-sorted input is (c).
This can be eliminated if maps can be placed near reduces. For
example, tasktrackers might report the size of each partition they're
generating and the jobtracker might use this to schedule reduces on
racks which already have a lot of their input.
No-sort optimization
--------------------
Key: HADOOP-939
URL: https://issues.apache.org/jira/browse/HADOOP-939
Project: Hadoop
Issue Type: New Feature
Components: mapred
Environment: all
Reporter: Doug Judd
There should be a way to tell the mapred framework that the output of
the map() phase will already be sorted. The Reduce phase can just
merge the intermediate files together without sorting.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.