[
https://issues.apache.org/jira/browse/ACCUMULO-1124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15297313#comment-15297313
]
Keith Turner commented on ACCUMULO-1124:
----------------------------------------
I experimented with shortening keys in the index and that gave some nice
improvements, but not as much as I expected. I realized that even with those
changes, bad keys were still being placed in the index. I added code to keep
statistics on key sizes and used those statistics to try to select keys that
were <=AVG(keySize). I also excluded keys that were too big (greater than 3
std dev from the mean). With the key shortening and statistics changes I was
able to reduce the index size for the file in my previous comment to that below.
{noformat}
RFile Version : 8
Locality group : <DEFAULT>
Num blocks : 21,758
Index level 1 : 3,048 bytes 1 blocks
Index level 0 : 1,873,885 bytes 8 blocks
First key : um:d:385:%03;%01;10.30.170.244>>o>/2954%af;
data:current [] 4611686019157309597 false
Last key : um:d:395:%03;%01;%ff;
com.facebook>.www>s>/dialog/feed?app_id=90376669494... TRUNCATED data:current
[] -6917529026891043602 false
Num entries : 24,299,468
Column families : [data]
Meta block : BCFile.index
Raw size : 4 bytes
Compressed size : 12 bytes
Compression type : gz
Meta block : RFile.index
Raw size : 3,163 bytes
Compressed size : 1,515 bytes
Compression type : gz
{noformat}
At first I thought I could make these changes in 1.6 and 1.7. However while
working on this I realized the key shortening change is breaking change, in
that older RFile code would not be able to handle keys in the index that do not
exist in the data. The changes to uses statistics to choose better keys could
be made in 1.6 and 1.7.
> optimize index size in RFile
> ----------------------------
>
> Key: ACCUMULO-1124
> URL: https://issues.apache.org/jira/browse/ACCUMULO-1124
> Project: Accumulo
> Issue Type: Improvement
> Reporter: Eric Newton
> Assignee: Keith Turner
> Fix For: 1.8.0
>
> Time Spent: 10m
> Remaining Estimate: 0h
>
> I noticed HBASE-7845 and it seems like something we could do in RFile, too.
> Instead of putting the whole key in the index, you put in enough of the key
> to get the reader to the beginning of the block.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)