Hi,

As far as I understand, let mapper produce top N records is not working
as each mapper only has partial knowledge of the data, which will not lead to
global optimal... I think your mapper needs to output all records
(combined) and let the reducer to pick the top N values.



-Rui



----- Original Message ----
From: Vadim Zaliva <[EMAIL PROTECTED]>
To: hadoop-user@lucene.apache.org
Sent: Tuesday, January 15, 2008 4:13:11 PM
Subject: Re: single output file



On Jan 15, 2008, at 13:57, Ted Dunning wrote:

> This is happening because you have many reducers running, only one  
> of which
> gets any data.
>
> Since you have combiners, this probably isn't a problem.  That
 reducer
> should only get as many records as you have maps.  It would be a  
> problem if
> your reducer were getting lots of input records.
>
> You can avoid this by setting the number of reducers to 1.

Thanks!

I also have another, perhaps stupid question. I am trying to write a  
task which will produce a list of records with top N values. My idea  
is to write a reducer class which iterates through records keeping N  
with biggest values and spits them out. I can use it as both a  
combiner and reducer class. This way each MAP task will produce N  
records and I will set up single reduce task which will combine them  
into final N records. (N is reasonably small, like 10). However to do  
this I  need to postpone issuing output until I am done processing all
  
records. I can try to do this in close() method, but I do not have an  
OutputCollector there. I guess I can write special output collector,  
but it seems a bit artificial.

Probably I am missing something obvious and there is a common and easy
  
way to do this?

Thanks!

Sincerely,
Vadim






      
____________________________________________________________________________________
Never miss a thing.  Make Yahoo your home page. 
http://www.yahoo.com/r/hs

Reply via email to