[
https://issues.apache.org/jira/browse/ACCUMULO-4314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15308840#comment-15308840
]
Keith Turner commented on ACCUMULO-4314:
----------------------------------------
I ran test with the changes in 1.7 for this issue using the same file I was
testing the changes for ACCUMULO-1124 with. The total index size went from
6.9M to 3.6M.
{noformat}
$ accumulo rfile-info /accumulo/tables/2/default_tablet/A0000005.rf
Reading file:
hdfs://localhost:10000/accumulo/tables/2/default_tablet/A0000005.rf
Locality group : <DEFAULT>
Start block : 0
Num blocks : 20,041
Index level 1 : 4,140 bytes 1 blocks
Index level 0 : 3,620,079 bytes 14 blocks
First key : um:d:385:%03;%01;10.30.170.244>>o>/2954%af;
data:current [] 4611686019157309597 false
Last key : um:d:395:%03;%01;%ff;
com.facebook>.www>s>/dialog/feed?app_id=90376669494... TRUNCATED data:current
[] -6917529026891043602 false
Num entries : 24,299,468
Column families : [data]
Meta block : BCFile.index
Raw size : 4 bytes
Compressed size : 12 bytes
Compression type : gz
Meta block : RFile.index
Raw size : 4,258 bytes
Compressed size : 2,154 bytes
Compression type : gz
{noformat}
> Use statistics to choose better keys for RFile index
> ----------------------------------------------------
>
> Key: ACCUMULO-4314
> URL: https://issues.apache.org/jira/browse/ACCUMULO-4314
> Project: Accumulo
> Issue Type: Improvement
> Reporter: Keith Turner
> Assignee: Keith Turner
> Priority: Blocker
> Fix For: 1.6.6, 1.7.2
>
>
> The commit for ACCUMULO-1124 makes two changes :
> * Generates shorter keys that may not exist in data to place in RFile index
> * Use statistics to make better choices about what keys to place in index.
> These changes look for keys that are average or below and excludes large keys
> (keys that are > 3 std dev).
> The change to generate shorter keys can not be made in 1.7.X and 1.6.X
> because it would generate RFiles that may not work properly with older 1.6
> and 1.7 versions. However the changes to use statistics to pick better keys
> could be made in 1.6 and 1.7.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)