Re: Patch

Ted Dunning Fri, 13 Jun 2008 22:15:36 -0700

Yes.  It does.

This can be the source of some imbalanced load in the reducer, but is
essential to correct functioning of the map-reduce model.


Sometimes it is good to actually add some additional salt to the key just so
that the large reduce lists get split up a bit for more equitable
distribution of the reduce workload.  You can only do that if you don't mind
reduces running on partial lists.  Of course, if you could do that, you
probably would have used a combiner already and not needed this hack.

On Fri, Jun 13, 2008 at 1:04 AM, Andreas Kostyrka <[EMAIL PROTECTED]>
wrote:

> Sorry, for replying the private email to the mailing list, but I strongly
> believe in leaving the next guy something to google ;)
>
> Anyway, as you seem to be knowledgeable about sorting, one question:
>
> Does hadoop provide all key/value tuples for a given key in one batch to
> the
> reducer, or not?
>
> TIA,
>
> Andreas
>
> On Friday 13 June 2008 02:48:52 you wrote:
> > Great deal; thanks for sending it to me.
> >
> > This has exactly the same pattern described in the JIRA
> > (HADOOP-3442); the partition that fails is nearly sorted and it's
> > selected one of the largest values as its pivot.
> >
> > The fix is checked into the 0.17 branch; if you check it out and
> > deploy it, your jobs should finish without causing the
> > StackOverflowError. If you're noticing inordinately long sort times
> > for your job (i.e. this is a common pattern for your data), then you
> > might consider applying HADOOP-3308 and HADOOP-3442 (the former so
> > the latter applies cleanly). Really sorry you hit this; let me know
> > if the sort times with the 0.17.1 branch are inordinately long, so
> > this can get another iteration if it needs it. -C
>



-- 
ted

Re: Patch

Reply via email to