Arkady Borkovsky wrote:
Does this model assume that the size of the output of reduce is similar
to the size of the input?
An important class of applications (mentioned in this thread before)
uses two inputs:
-- M ("master file") -- very large, presorted and not changing from run
to run,
-- D ("details file") -- smaller, different from run to run, not
necessarily presorted
and the output size is proportional to the size of D.
In this case the gain from "no-sort" may be much higher, as the 13
"transfer and write" to DFS are applied to a smaller amount of data,
while 11 (b-d) sort-n-shuffle-related are saved on the larger data).
Could a combiner be used in this hypothetical case? If so, then the b-d
steps might be faster too.
Doug