better caching for UDF states in map-side group bys

Joydeep Sen Sarma Mon, 02 Feb 2009 18:05:17 -0800

So I had this thought - why not use arrays of primitive types to store UDF 
state instead of objects.


The background is that if one stores int [] intarray - then java uses 4 bytes 
for each additional element in the array (verified). Instead if one stores an 
array of objects that store an int - then there seems to be about 16-20 bytes 
of extra overhead per object (not sure precisely - this is what it seems on my 
limited experiments).

So imagine that:
-          we maintained states for UDFs in primitive arrays (this is the UDFs 
responsibility)
-          we had a customized HashMap implementation that stored an index 
(int) as value for a key (keys are still objects - but values are just 4 byte 
ints)
o        looked at the jdk source - this seems straightforward
-          to update state - we give the index to the evaluator. The evaluator 
can then index into whatever arrays it maintains and do whatever it wants. If 
it allocates a new slot in the array - then it can return the allocated index 
back to the framework (to store against the key)

this way - at least we can get rid of the object overhead from the value part 
of the hashmap.

Somewhat hacky (getting Java to work like C) - but this can be made to work I 
think.

There is the issue of managing the free slots on the array (which are created 
on a flush) - but I think we can overlap a free list on top of the primitive 
array (say every free slot stores the index of the next free slot. When slots 
are freed - we can chain the new free slots into the existing head of the free 
list).

Thoughts?

better caching for UDF states in map-side group bys

Reply via email to