Re: setNumReduceTasks(1)

Mridul Muralidharan Sat, 30 Jan 2010 01:36:47 -0800

Top K is slightly more complicated (in comparison) to implementefficiently : you might want to look at other projects like pig to seehow they do it (to compare and look at ideas).

Just to get an understanding - your mappers generate <key, value>, andyou want to pick top K based on value in reducer side ?Or can you have multiple key's coming in from various mappers and youneed to aggregate it at reducer ?

If former (that is key is unique), then a combiner to emit's top K permapper, and then a single reducer which sorts and picks from the M * C *K tuples should do the trick (M == number of mappers, C == avg number ofcombiner invocations per mapper, K == number of output tuples required).

If latter, you can try to do heuristics to approximate the value, but italways has a error margin (to efficiently do it : this is something Iask in interviews :) ) which you will need to take into account - or youcan just split it into two jobs : aggregate in job 1, top K in job 2.


Regards,
Mridul


Something Something wrote:

N could be up to 1000, and output from Map job could be about 5 Million.  We
only want the top 1000 because rest of it could be just noise.  Thanks for
your help.

On Fri, Jan 29, 2010 at 11:43 AM, Alex Baranov <[email protected]>wrote:

How big is N?  How big is outcome of Map job?

Alex.

On Fri, Jan 29, 2010 at 7:36 PM, Something Something <
[email protected]> wrote:

I am sorry, but I forgot to add one important piece of information.

I don't want to write any random N rows to the table.  I want to write

the

*top* N rows - meaning - I want to write the "key" values of the Reducer

in

descending order.  Does this make sense?  Sorry for the confusion.

On Wed, Jan 27, 2010 at 11:09 PM, Mridul Muralidharan <
[email protected]

wrote:
A possible solution is to emit only N rows from each mapper and then

use

reduce task [*] - if value of N is not very high.
So you end up with utmost m * N rows on reducer instead of full

inputset

and so the limit can be done easier.


If you ok with some sort of variance in the number of rows inserted

(and

if

value of N is very high), you can do more interesting things like N/m'

rows

per mapper - and multiple reducers (r) : with assumtion that each

reducer

will see atleast N/r rows - and so you can limit to N/r per reducer :
ofcourse, there is a possible error that gets introduced here ...


Regards,
Mridul

[*] Assuming you just want simple limit - nothing else.
Also note, each mapper might want to emit N rows instead of 'tweaks'

like

N/m rows, since it is possible that multiple mappers might have less

than

N/m rows to emit to begin with !



Something Something wrote:

If I set # of reduce tasks to 1 using setNumReduceTasks(1), would the
class
be instantiated only on one machine.. always?  I mean if I have a

cluster

of
say 1 master, 10 workers & 3 zookeepers, is the Reducer class

guaranteed

to
be instantiated only on 1 machine?

If answer is yes, then I will use static variable as a counter to see

how

may rows have been added to my HBase table so far.  In my use case, I

want

to write only N number of rows to a table.  Is there a better way to

do

this?  Please let me know.  Thanks.

Re: setNumReduceTasks(1)

Reply via email to