[ 
https://issues.apache.org/jira/browse/ACCUMULO-1124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15297313#comment-15297313
 ] 

Keith Turner commented on ACCUMULO-1124:
----------------------------------------

I experimented with shortening keys in the index and that gave some nice 
improvements, but not as much as I expected.  I realized that even with those 
changes, bad keys were still being placed in the index.  I added code to keep 
statistics on key sizes and used those statistics to try to select keys that 
were <=AVG(keySize).  I also excluded keys that were too big (greater than 3 
std dev from the mean).  With the key shortening and statistics changes I was 
able to reduce the index size for the file in my previous comment to that below.

{noformat}
RFile Version            : 8

Locality group           : <DEFAULT>
        Num   blocks           : 21,758
        Index level 1          : 3,048 bytes  1 blocks
        Index level 0          : 1,873,885 bytes  8 blocks
        First key              : um:d:385:%03;%01;10.30.170.244>>o>/2954%af; 
data:current [] 4611686019157309597 false
        Last key               : um:d:395:%03;%01;%ff; 
com.facebook>.www>s>/dialog/feed?app_id=90376669494... TRUNCATED data:current 
[] -6917529026891043602 false
        Num entries            : 24,299,468
        Column families        : [data]

Meta block     : BCFile.index
      Raw size             : 4 bytes
      Compressed size      : 12 bytes
      Compression type     : gz

Meta block     : RFile.index
      Raw size             : 3,163 bytes
      Compressed size      : 1,515 bytes
      Compression type     : gz
{noformat}

At first I thought I could make these changes in 1.6 and 1.7.  However while 
working on this I realized the key shortening change is breaking change, in 
that older RFile code would not be able to handle keys in the index that do not 
exist in the data.   The changes to uses statistics to choose better keys could 
be made in 1.6 and 1.7.

> optimize index size in RFile
> ----------------------------
>
>                 Key: ACCUMULO-1124
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-1124
>             Project: Accumulo
>          Issue Type: Improvement
>            Reporter: Eric Newton
>            Assignee: Keith Turner
>             Fix For: 1.8.0
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> I noticed HBASE-7845 and it seems like something we could do in RFile, too.
> Instead of putting the whole key in the index, you put in enough of the key 
> to get the reader to the beginning of the block.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to