Re: [jira] Commented: (HADOOP-939) No-sort optimization

Arkady Borkovsky Mon, 29 Jan 2007 14:05:01 -0800

How savings can be much higher than 1/3.
(probably I'm already preaching to the quire)


1. The situation I have in mind looks like this:

-- you have a data set M with 10^11 to 10^12 records that is producedonce in a while and used hundreds of times (before its new version isgenerated). This data set has been produced by a key-preservingreduce, so all its fragments are sorted, and the keys are split betweenthe fragments in a known way.-- Each time you "use" this data set, you have another data set D withthe same keys space (and, for simplicity, same record type) with 10^8or 10^9 records.

The job is a JOIN -- it produces an output record for each key in theintersection of M and D. Think of M as a database and D as a query.The the number of output records is less or equal to the number ofrecords in D.

If D fits into memory, you do not need a reduce, and everything can bedone in map by copying D to all the map tasks. A pain, but no realproblems -- map with no reduce does it.

If D does not fit into memory, a natural way do this processing is to

-- sort and split D into buckets so that there is one bucket for eachblock of M, with the same keys.-- run a reduce task on each block of M close to this block and merge(join) this block with the corresponding bucket of D while reading theinput.So steps b-d are needed only for D, and steps e-g -- for the outputthat is the same size as D -- few percent of the total data involved.


In this kind of applications, Eric's model becomes
now     no-sort    step
M+D     M+D        a.  1 read input data from local drive on map node
[ identity map ]

M+D D b. 1 write batches of sorted output data totemporary file on map node

M+D      D         c. 10 shuffle batches of sorted data
M+D      D  ("local" reduce)
        M+D ("anywhere" reduce)
                   d.  1 write batches of sorted data to reduce node
[ reduce]
D        D         e.  1 write one copy of output locally

D D f. 2 transfer and write one copy to another node onthe same rack

D        D         g. 11 transfer and write one copy to an off-rack node

So it is 27*D+13*M vs. 17*D+2*M . With D<<M, the gain is about 6xor 12x, with ("anywhere" and "local" reduce, correspondingly).

2. A variation of the situation described above is when in addition toM ("database") and D ("query"), you have a third input data set U("incremental updates" or "deltas"). U has the size similar to D; therecord type of U is exactly the same as that in M. The JOIN workssimilar to described above, but it takes a record from D, and one ormore records from U and M.




On Jan 29, 2007, at 10:17 AM, Doug Cutting wrote:

Arkady Borkovsky wrote:
Does this model assume that the size of the output of reduce issimilar to the size of the input?An important class of applications (mentioned in this thread before)uses two inputs:-- M ("master file") -- very large, presorted and not changing fromrun to run,-- D ("details file") -- smaller, different from run to run, notnecessarily presorted
and the output size is proportional to the size of D.
In this case the gain from "no-sort" may be much higher, as the 13"transfer and write" to DFS are applied to a smaller amount of data,while 11 (b-d) sort-n-shuffle-related are saved on the larger data).
Could a combiner be used in this hypothetical case? If so, then theb-d steps might be faster too.
Doug

Re: [jira] Commented: (HADOOP-939) No-sort optimization

Reply via email to