[ 
https://issues.apache.org/jira/browse/HBASE-3551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13005272#comment-13005272
 ] 

Marc Limotte commented on HBASE-3551:
-------------------------------------

I understand this better now.  I did some poking around with the HFile tool.  
Average key length does seem to be around 150 bytes, as I estimated.
 
For one hfile 
/hbase/foo/fb820ae7002fc96f78165802a0b05e63/metrics/14129209576094096, metadata 
is:

avgKeyLen=159, avgValueLen=7, entries=49285512, length=615516343
fileinfoOffset=592314718, dataIndexOffset=592315104, dataIndexCount=131869, 
metaIndexOffset=0, metaIndexCount=0, totalBytes=8653853680, 
entryCount=49285512, version=1

Size of index = length - dataIndexOffset = 615516343 - 592315104 = 22mb

Index data per Region Server = 22mb * 180 regions = almost 4gb.  Plus the other 
column family, so this does seem to add up to the 5 to 6gb of HEAP we are 
seeing.

# of entries per dataindex entry = 49285512 / 131869 = 374
Times the key size (avg 157 bytes for this file) = 59k (close to the block size 
of 64k).  So, seems to make sense.

I also looked at the keyvalue pairs using the HFile tool (a section of output 
is below).

We have a few billion rows (2 - 4 billion).  I haven't done a full row count.

What I didn't understand previously is that it's not 374 rows, but 374 
"entries".  An entry means a single column entry and the key is repeated for 
each column value.  Given our fairly large key, that would add up quickly.

Solutions
1) Increase the hbase block size (I did this and it resolved our situation for 
now)  
2) Modifying our schema to use smaller keys - perhaps IDs instead of string 
names.
3) Modifying our schema to have fewer columns - we could combine several 
related columns into one compound value.
4) An LRU cache for storefile indexes

Given the other options, #4 may not be warranted, so I think we can close this 
issue.


> Loaded hfile indexes occupy a good chunk of heap; look into shrinking the 
> amount used and/or evicting unused indices
> --------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-3551
>                 URL: https://issues.apache.org/jira/browse/HBASE-3551
>             Project: HBase
>          Issue Type: Improvement
>            Reporter: stack
>
> I hung with a user Marc and we were looking over configs and his cluster 
> profile up on ec2.  One thing we noticed was that his 100+ 1G regions of two 
> families had ~2.5G of heap resident.  We did a bit of math and couldn't get 
> to 2.5G so that needs looking into.  Even still, 2.5G is a bunch of heap to 
> give over to indices (He actually OOME'd when he had his RS heap set to just 
> 3G; we shouldn't OOME, we should just run slower).  It sounds like he needs 
> the indices loaded but still, for some cases we should drop indices for 
> unaccessed files.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to