Hi all,

 

I am facing a problem with aggregations where reduce groups are
extremely large. 

 

It's a very common usage scenario - for example someone might want the
equivalent of 'count (distinct.field2) from events e group by e.field1'.
the natural thing to do is emit e.field1 as the map-key and do the
distinct and count in the reduce.

 

Unfortunately, the values in the reduce phase have to be all pulled into
memory. And we end up running out of memory for large groups. It would
be great if the values iterator were able to seamlessly pull in data
from disk - especially since the data is coming from persistent store.

 

I was wondering if other people have faced this problem - and what they
have done (there are some solutions I have been suggested - like first
doing a group by on field1_hash(field2) to reduce group size - but they
are a pain to implement). And how difficult would it be to have an
iterator iterate over on-disk - rather than in memory - values?

 

Thx,

 

Joydeep

 

 

Reply via email to