Yeah - I am doing it with two MR jobs right now.

Understood the second solution. Is this what Pig uses internally (lazy -
should just look at the code)?

(One of the issues is that the optimal implementation requires
anticipating the group size. Easy to do by custom code, hard to do
automatically .. (would have to maintain approximate counts of distinct
values by each dimension))

-----Original Message-----
From: Ted Dunning [mailto:[EMAIL PROTECTED] 
Sent: Thursday, October 11, 2007 2:39 PM
To: [email protected]
Subject: Re: large reduce group sizes


First off, as in my previous mail, you don't need special mechanism to
do
the unique count.  Just add another MR step (which is, of course,
sorting
its little heart out).

Secondly, you can definitely force the reduce iterator to give you
values in
order.  The method is not very obvious, but there was a long thread on
it a
week or so ago.

The summary outline is that you have to introduce the data you want to
sort
on into the key.  Then you have to define both a sort order AND a
partition
function.  The only problem is that you will only see ONE of the keys so
you
have to duplicate any data your want in the value.

It isn't hard. But it also definitely isn't obvious.

I can't wait to get Pig out there so we don't have to know all of this.


On 10/11/07 2:32 PM, "Joydeep Sen Sarma" <[EMAIL PROTECTED]> wrote:

> great! Didn't realize that the iterator was disk based.
> 
> The below sounds very doable. Will give it a shot. Do you see this as
an
> option in the mapred job (optionally sort values)?
> 
> -----Original Message-----
> From: Runping Qi [mailto:[EMAIL PROTECTED]
> Sent: Thursday, October 11, 2007 2:04 PM
> To: [email protected]
> Subject: RE: large reduce group sizes
> 
> 
> The values to reduce is an disk backed iterator.
> The problematic part is to compute the distinct count.
> You have to keep the unique values in memory, or you have to use some
> other
> tricks.
> One of such tricks is sampling. The other is to do write the values
out
> to
> disk to do a merge sort, then read the sorted values in sequentially.
> It would be nice if somebody can contribute a patch.
> 
> Runping
>  
> 
>> -----Original Message-----
>> From: Joydeep Sen Sarma [mailto:[EMAIL PROTECTED]
>> Sent: Thursday, October 11, 2007 1:17 PM
>> To: [email protected]
>> Subject: large reduce group sizes
>> 
>> Hi all,
>> 
>> 
>> 
>> I am facing a problem with aggregations where reduce groups are
>> extremely large.
>> 
>> 
>> 
>> It's a very common usage scenario - for example someone might want
the
>> equivalent of 'count (distinct.field2) from events e group by
> e.field1'.
>> the natural thing to do is emit e.field1 as the map-key and do the
>> distinct and count in the reduce.
>> 
>> 
>> 
>> Unfortunately, the values in the reduce phase have to be all pulled
> into
>> memory. And we end up running out of memory for large groups. It
would
>> be great if the values iterator were able to seamlessly pull in data
>> from disk - especially since the data is coming from persistent
store.
>> 
>> 
>> 
>> I was wondering if other people have faced this problem - and what
> they
>> have done (there are some solutions I have been suggested - like
first
>> doing a group by on field1_hash(field2) to reduce group size - but
> they
>> are a pain to implement). And how difficult would it be to have an
>> iterator iterate over on-disk - rather than in memory - values?
>> 
>> 
>> 
>> Thx,
>> 
>> 
>> 
>> Joydeep
>> 
>> 
>> 
>> 
> 
> 

Reply via email to