Top K is slightly more complicated (in comparison) to implement
efficiently : you might want to look at other projects like pig to see
how they do it (to compare and look at ideas).
Just to get an understanding - your mappers generate <key, value>, and
you want to pick top K based on value in reducer side ?
Or can you have multiple key's coming in from various mappers and you
need to aggregate it at reducer ?
If former (that is key is unique), then a combiner to emit's top K per
mapper, and then a single reducer which sorts and picks from the M * C *
K tuples should do the trick (M == number of mappers, C == avg number of
combiner invocations per mapper, K == number of output tuples required).
If latter, you can try to do heuristics to approximate the value, but it
always has a error margin (to efficiently do it : this is something I
ask in interviews :) ) which you will need to take into account - or you
can just split it into two jobs : aggregate in job 1, top K in job 2.
Regards,
Mridul
Something Something wrote:
N could be up to 1000, and output from Map job could be about 5 Million. We
only want the top 1000 because rest of it could be just noise. Thanks for
your help.
On Fri, Jan 29, 2010 at 11:43 AM, Alex Baranov <[email protected]>wrote:
How big is N? How big is outcome of Map job?
Alex.
On Fri, Jan 29, 2010 at 7:36 PM, Something Something <
[email protected]> wrote:
I am sorry, but I forgot to add one important piece of information.
I don't want to write any random N rows to the table. I want to write
the
*top* N rows - meaning - I want to write the "key" values of the Reducer
in
descending order. Does this make sense? Sorry for the confusion.
On Wed, Jan 27, 2010 at 11:09 PM, Mridul Muralidharan <
[email protected]
wrote:
A possible solution is to emit only N rows from each mapper and then
use
1
reduce task [*] - if value of N is not very high.
So you end up with utmost m * N rows on reducer instead of full
inputset
-
and so the limit can be done easier.
If you ok with some sort of variance in the number of rows inserted
(and
if
value of N is very high), you can do more interesting things like N/m'
rows
per mapper - and multiple reducers (r) : with assumtion that each
reducer
will see atleast N/r rows - and so you can limit to N/r per reducer :
ofcourse, there is a possible error that gets introduced here ...
Regards,
Mridul
[*] Assuming you just want simple limit - nothing else.
Also note, each mapper might want to emit N rows instead of 'tweaks'
like
N/m rows, since it is possible that multiple mappers might have less
than
N/m rows to emit to begin with !
Something Something wrote:
If I set # of reduce tasks to 1 using setNumReduceTasks(1), would the
class
be instantiated only on one machine.. always? I mean if I have a
cluster
of
say 1 master, 10 workers & 3 zookeepers, is the Reducer class
guaranteed
to
be instantiated only on 1 machine?
If answer is yes, then I will use static variable as a counter to see
how
may rows have been added to my HBase table so far. In my use case, I
want
to write only N number of rows to a table. Is there a better way to
do
this? Please let me know. Thanks.