Re: How to handle imbalanced data in hadoop ?

brien colwell Sun, 15 Nov 2009 14:00:37 -0800

My first thought is that it depends on the reduce logic. If you could do the
reduction in two passes then you could do an initial arbitrary partition for
the majority key and bring the partitions together in a second reduction (or
a map-side join). I would use a round robin strategy to assign the arbitrary
partitions.





On Sat, Nov 14, 2009 at 11:03 PM, Jeff Zhang <zjf...@gmail.com> wrote:

> Hi all,
>
> Today there's a problem about imbalanced data come out of mind .
>
> I'd like to know how hadoop handle this kind of data.  e.g. one key
> dominates the map output, say 99%. So 99% data set will go to one reducer,
> and this reducer will become the bottleneck.
>
> Does hadoop have any other better ways to handle such imbalanced data set ?
>
>
> Jeff Zhang
>

Re: How to handle imbalanced data in hadoop ?

Reply via email to