It is the latter, not the former. Yes, but approximation is not allowed in this case. Not sure how splitting in 2 jobs would help but let me think more about it. Thanks for the help.
On Sat, Jan 30, 2010 at 1:35 AM, Mridul Muralidharan <[email protected]>wrote: > > Top K is slightly more complicated (in comparison) to implement efficiently > : you might want to look at other projects like pig to see how they do it > (to compare and look at ideas). > > Just to get an understanding - your mappers generate <key, value>, and you > want to pick top K based on value in reducer side ? > Or can you have multiple key's coming in from various mappers and you need > to aggregate it at reducer ? > > > If former (that is key is unique), then a combiner to emit's top K per > mapper, and then a single reducer which sorts and picks from the M * C * K > tuples should do the trick (M == number of mappers, C == avg number of > combiner invocations per mapper, K == number of output tuples required). > > If latter, you can try to do heuristics to approximate the value, but it > always has a error margin (to efficiently do it : this is something I ask in > interviews :) ) which you will need to take into account - or you can just > split it into two jobs : aggregate in job 1, top K in job 2. > > Regards, > Mridul > > > > Something Something wrote: > >> N could be up to 1000, and output from Map job could be about 5 Million. >> We >> only want the top 1000 because rest of it could be just noise. Thanks for >> your help. >> >> On Fri, Jan 29, 2010 at 11:43 AM, Alex Baranov <[email protected] >> >wrote: >> >> How big is N? How big is outcome of Map job? >>> >>> Alex. >>> >>> On Fri, Jan 29, 2010 at 7:36 PM, Something Something < >>> [email protected]> wrote: >>> >>> I am sorry, but I forgot to add one important piece of information. >>>> >>>> I don't want to write any random N rows to the table. I want to write >>>> >>> the >>> >>>> *top* N rows - meaning - I want to write the "key" values of the Reducer >>>> >>> in >>> >>>> descending order. Does this make sense? Sorry for the confusion. >>>> >>>> On Wed, Jan 27, 2010 at 11:09 PM, Mridul Muralidharan < >>>> [email protected] >>>> >>>>> wrote: >>>>> A possible solution is to emit only N rows from each mapper and then >>>>> >>>> use >>> >>>> 1 >>>> >>>>> reduce task [*] - if value of N is not very high. >>>>> So you end up with utmost m * N rows on reducer instead of full >>>>> >>>> inputset >>> >>>> - >>>> >>>>> and so the limit can be done easier. >>>>> >>>>> >>>>> If you ok with some sort of variance in the number of rows inserted >>>>> >>>> (and >>> >>>> if >>>> >>>>> value of N is very high), you can do more interesting things like N/m' >>>>> >>>> rows >>>> >>>>> per mapper - and multiple reducers (r) : with assumtion that each >>>>> >>>> reducer >>> >>>> will see atleast N/r rows - and so you can limit to N/r per reducer : >>>>> ofcourse, there is a possible error that gets introduced here ... >>>>> >>>>> >>>>> Regards, >>>>> Mridul >>>>> >>>>> [*] Assuming you just want simple limit - nothing else. >>>>> Also note, each mapper might want to emit N rows instead of 'tweaks' >>>>> >>>> like >>> >>>> N/m rows, since it is possible that multiple mappers might have less >>>>> >>>> than >>> >>>> N/m rows to emit to begin with ! >>>>> >>>>> >>>>> >>>>> Something Something wrote: >>>>> >>>>> If I set # of reduce tasks to 1 using setNumReduceTasks(1), would the >>>>>> class >>>>>> be instantiated only on one machine.. always? I mean if I have a >>>>>> >>>>> cluster >>>> >>>>> of >>>>>> say 1 master, 10 workers & 3 zookeepers, is the Reducer class >>>>>> >>>>> guaranteed >>> >>>> to >>>>>> be instantiated only on 1 machine? >>>>>> >>>>>> If answer is yes, then I will use static variable as a counter to see >>>>>> >>>>> how >>>> >>>>> may rows have been added to my HBase table so far. In my use case, I >>>>>> >>>>> want >>>> >>>>> to write only N number of rows to a table. Is there a better way to >>>>>> >>>>> do >>> >>>> this? Please let me know. Thanks. >>>>>> >>>>>> >>>>> >
