One option that should work reasonably well is to have each mapper output
with a constant key (as Rui suggests) and use a combiner to pre-select the
top N elements.  Communication between mappers and combiners is very fast,
so this will be just about as fast as was originally suggested.  The cost of
the final single reducer will be so small as to be unmeasurable.

On 1/15/08 5:02 PM, "Rui Shi" <[EMAIL PROTECTED]> wrote:

> Hi,
> 
> As far as I understand, let mapper produce top N records is not working
> as each mapper only has partial knowledge of the data, which will not lead to
> global optimal... I think your mapper needs to output all records
> (combined) and let the reducer to pick the top N values.
> 
> 
> 
> -Rui
> 
> 
> 
> ----- Original Message ----
> From: Vadim Zaliva <[EMAIL PROTECTED]>
> To: hadoop-user@lucene.apache.org
> Sent: Tuesday, January 15, 2008 4:13:11 PM
> Subject: Re: single output file
> 
> 
> 
> On Jan 15, 2008, at 13:57, Ted Dunning wrote:
> 
>> This is happening because you have many reducers running, only one
>> of which
>> gets any data.
>> 
>> Since you have combiners, this probably isn't a problem.  That
>  reducer
>> should only get as many records as you have maps.  It would be a
>> problem if
>> your reducer were getting lots of input records.
>> 
>> You can avoid this by setting the number of reducers to 1.
> 
> Thanks!
> 
> I also have another, perhaps stupid question. I am trying to write a
> task which will produce a list of records with top N values. My idea
> is to write a reducer class which iterates through records keeping N
> with biggest values and spits them out. I can use it as both a
> combiner and reducer class. This way each MAP task will produce N
> records and I will set up single reduce task which will combine them
> into final N records. (N is reasonably small, like 10). However to do
> this I  need to postpone issuing output until I am done processing all
>   
> records. I can try to do this in close() method, but I do not have an
> OutputCollector there. I guess I can write special output collector,
> but it seems a bit artificial.
> 
> Probably I am missing something obvious and there is a common and easy
>   
> way to do this?
> 
> Thanks!
> 
> Sincerely,
> Vadim
> 
> 
> 
> 
> 
> 
>       
> ______________________________________________________________________________
> ______
> Never miss a thing.  Make Yahoo your home page.
> http://www.yahoo.com/r/hs

Reply via email to