Re: Size of KeyValue

lars hofhansl Thu, 01 Dec 2011 10:58:17 -0800

To try this out I changed the server side code to keep track of the latest KV 
rather than the row, and also remove the caching from KV.
The difference for GC and runtime is within the noise. I have to conclude that 
allocating KVs is just not a big problem compared to other garbage being
produced during scans.



----- Original Message -----
From: Todd Lipcon <[email protected]>
To: [email protected]
Cc: Stack <[email protected]>
Sent: Tuesday, November 29, 2011 11:25 PM
Subject: Re: Size of KeyValue

On Mon, Nov 28, 2011 at 8:05 PM, Lars <[email protected]> wrote:
> Hmm, interesting. It's used (among others) in server side scanners to hold 
> the current row. Could just keep a reference to the KeyValue around instead. 
> Need to make we don't hold on to the current blocks buffer forever, though.
>

Yep - another option is for the scanner to have its own byte[] where
it copies the current row key into - since it's already doing
comparisons with it for every advance of the scanner, it should be in
L2 cache if not L1, and the copy would be minimally expensive.

-Todd

> Todd Lipcon <[email protected]> schrieb:
>
>>If I recall correctly, we put this in more for the benefit of the
>>client side, with the assumption that the server side would never call
>>this API. Then, we ended up writing some bad code somewhere in the
>>server which calls this function.
>>
>>-Todd
>>
>>On Mon, Nov 28, 2011 at 5:42 PM, lars hofhansl <[email protected]> wrote:
>>> Did some (unscientific) tests...
>>> The test scenario is 1m (wide) rows (50m cells) and then scanning along all 
>>> cells.
>>>
>>> The difference in runtime is within the noise. When I measure GC stats with 
>>> jstat I see a ~3% reduction is young collections, and a ~10% reduction in 
>>> overall GC time.
>>>
>>> At the same time I set a counting breakpoint on KeyValue.getRow that fires 
>>> when rowcache is found not-null. I found this triggered about every 16 key 
>>> values, which suggests
>>> the optimization saves a lot of copying of the row key.
>>>
>>> It is not entirely clear under what circumstances the rowCache would a be 
>>> win, outweighing the extra static memory by every KV.
>>>
>>> So it looks like it is not worth making the change, although I suppose 
>>> anything reducing GC pressure is a win.
>>>
>>> -- Lars
>>>
>>>
>>> ----- Original Message -----
>>> From: lars hofhansl <[email protected]>
>>> To: Stack <[email protected]>; "[email protected]" <[email protected]>
>>> Cc:
>>> Sent: Thursday, November 24, 2011 11:57 AM
>>> Subject: Re: Size of KeyValue
>>>
>>> Hmm... Might be hard to prove whether removing that would be a net win or 
>>> net loss in the current code base.
>>> I'll do some tests and report back.
>>>
>>>
>>>
>>> ________________________________
>>> From: Stack <[email protected]>
>>> To: [email protected]; lars hofhansl <[email protected]>
>>> Sent: Wednesday, November 23, 2011 8:41 PM
>>> Subject: Re: Size of KeyValue
>>>
>>> On Wed, Nov 23, 2011 at 3:40 PM, lars hofhansl <[email protected]> wrote:
>>>> Looking at KeyValue I see three variable purely used for caching:
>>>> timestampCache(long), rowCache(byte[]), and keyLength(int).
>>>>
>>>> From a quick glance over the code I do not see many spots where we 
>>>> repeatedly get the TS, rowKey, of keyLength from the same KV.
>>>> Together these consume 24 bytes (almost 1/2 of KeyValue's constant memory 
>>>> overhead) on every key value created, and we create
>>>> a *lot* KVs (real and "fake" ones) during scanning and seeking.
>>>>
>>>> Were these added to address specific performance concerns? If not, we 
>>>> might consider removing these.
>>>>
>>>
>>> IIRC, I added them after watching stuff in a profiler (a long time
>>> ago).  Things change.  Thats a lot of static mem to give up.
>>>
>>> St.Ack
>>>
>>
>>
>>
>>--
>>Todd Lipcon
>>Software Engineer, Cloudera
>



-- 
Todd Lipcon
Software Engineer, Cloudera

Re: Size of KeyValue

Reply via email to