Currently the flush can only flush part of the hashmap (under memory pressure). This makes sense (especially combined with a lfu replacement strategy). So based on this - the free slot management would be necessary.
If we can make the serialization fast enough - key serialization could a big win .. -----Original Message----- From: Zheng Shao [mailto:zsh...@gmail.com] Sent: Monday, February 02, 2009 7:36 PM To: hive-dev@hadoop.apache.org Subject: Re: better caching for UDF states in map-side group bys Yeah that will only work with a customized HashMap, but it's definitely possible to do. We might want to serialize the key to byte[] as well (using TBinarySortableProtocol etc). The free slot management does not seem to be a necessity, since when we flush all slots will be free. Zheng On Mon, Feb 2, 2009 at 6:04 PM, Joydeep Sen Sarma <jssa...@facebook.com>wrote: > So I had this thought - why not use arrays of primitive types to store UDF > state instead of objects. > > The background is that if one stores int [] intarray - then java uses 4 > bytes for each additional element in the array (verified). Instead if one > stores an array of objects that store an int - then there seems to be about > 16-20 bytes of extra overhead per object (not sure precisely - this is what > it seems on my limited experiments). > > So imagine that: > - we maintained states for UDFs in primitive arrays (this is the > UDFs responsibility) > - we had a customized HashMap implementation that stored an index > (int) as value for a key (keys are still objects - but values are just 4 > byte ints) > o looked at the jdk source - this seems straightforward > - to update state - we give the index to the evaluator. The > evaluator can then index into whatever arrays it maintains and do whatever > it wants. If it allocates a new slot in the array - then it can return the > allocated index back to the framework (to store against the key) > > this way - at least we can get rid of the object overhead from the value > part of the hashmap. > > Somewhat hacky (getting Java to work like C) - but this can be made to work > I think. > > There is the issue of managing the free slots on the array (which are > created on a flush) - but I think we can overlap a free list on top of the > primitive array (say every free slot stores the index of the next free slot. > When slots are freed - we can chain the new free slots into the existing > head of the free list). > > Thoughts? > -- Yours, Zheng