Re: Suggestions of proper usage of "key" parameter ?

Owen O'Malley Sun, 14 Dec 2008 23:40:03 -0800


On Dec 14, 2008, at 4:47 PM, Ricky Ho wrote:

Yes, I am referring to the "key" INPUT INTO the map() function andthe "key" EMITTED FROM the reduce() function. Can someone explainwhy do we need a "key" in these cases and what is the proper use ofit ?


It was a design choice and could have been done as:

R1 -> map -> K,V -> reduce -> R2

instead of

K1,V1 -> map -> K2,V2 -> reduce -> K3,V3

but since the input of the reduce is sorted on K2, the output of thereduce is also typically sorted and therefore keyed. Since jobs areoften chained together, it makes sense to make the reduce input matchthe map input. Of course everything you could do with the first optionis possible with the second using either K1 = R1 or V1 = R1. Havingthe keys is often convenient...

Who determines what the "key" should be ? (by the corresponding"InputFormat" implementation class) ?


The InputFormat makes the choice.

In this case, what is the key in the map() call ? (name of theinput file) ?

TextInputFormat uses the byte offset as the key and the line as thevalue.

What if the reduce() function emits multiple <key, value> entries ornot emitting any entry at all ? Is this considered OK ?


Yes.

What if the reduce() function emits a <key, value> entry whose keyis not the same as the input key parameter to the reduce()function ? Is this OK ?

Yes, although the reduce output is not re-sorted, so the results won'tbe sorted unless K3 is a subset of K2.

If there is a two Map/Reduce cycle chained together. Is the "key"input into the 2nd round map() function determined by the "key"emitted from the 1st round reduce() function ?


Yes.

-- Owen

Re: Suggestions of proper usage of "key" parameter ?

Reply via email to