[
https://issues.apache.org/jira/browse/ACCUMULO-1124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15298561#comment-15298561
]
Keith Turner commented on ACCUMULO-1124:
----------------------------------------
I pushed a commit to the PR that adds a {{--keyStats}} option to rfile-info.
Below is the output of running this command on the original file. Can see that
the 6 largest keys all ended up in the index. Also the average key size in the
index is over twice that of the data.
{noformat}
$ accumulo rfile-info --keyStats ~/A000rxoi.rf
Reading file: file:/home/fluo/A000rxoi.rf
RFile Version : 7
Locality group : notify
Start block : 0
Num blocks : 0
Index level 0 : 0 bytes 1 blocks
First key : null
Last key : null
Num entries : 0
Column families : [ntfy]
Locality group : <DEFAULT>
Start block : 0
Num blocks : 21,818
Index level 3 : 120,581 bytes 1 blocks
Index level 2 : 451,008 bytes 2 blocks
Index level 1 : 714,687 bytes 3 blocks
Index level 0 : 6,915,137 bytes 25 blocks
First key : um:d:385:%03;%01;10.30.170.244>>o>/2954%af;
data:current [] 4611686019157309597 false
Last key : um:d:395:%03;%01;%ff;
com.facebook>.www>s>/dialog/feed?app_id=90376669494... TRUNCATED data:current
[] -6917529026891043602 false
Num entries : 24,299,468
Column families : [data]
Meta block : BCFile.index
Raw size : 4 bytes
Compressed size : 12 bytes
Compression type : gz
Meta block : RFile.index
Raw size : 120,754 bytes
Compressed size : 21,719 bytes
Compression type : gz
Statistics for keys in data :
Up to size count %-age
10 : 10768926 26.51%
100 : 13471699 70.82%
1000 : 58725 2.56%
10000 : 112 0.07%
100000 : 6 0.04%
1000000 : 0 0.00%
10000000 : 0 0.00%
100000000 : 0 0.00%
1000000000 : 0 0.00%
10000000000 : 0 0.00%
min: 31.00 max: 330,380.00 avg: 122.99 stddev: 157.51
Statistics for keys in index :
Up to size count %-age
10 : 6192 7.67%
100 : 15024 49.96%
1000 : 578 13.21%
10000 : 18 8.83%
100000 : 6 20.33%
1000000 : 0 0.00%
10000000 : 0 0.00%
100000000 : 0 0.00%
1000000000 : 0 0.00%
10000000000 : 0 0.00%
min: 36.00 max: 330,380.00 avg: 281.73 stddev: 3,901.56
$
{noformat}
Below is the output of running this command on a file compacted using the code
in the PR. None of the largest keys are in the index and the average key size
in the index is less than half of whats in the data.
{noformat}
$ accumulo rfile-info --keyStats /accumulo/tables/2/default_tablet/A0000005.rf
Reading file:
hdfs://localhost:10000/accumulo/tables/2/default_tablet/A0000005.rf
RFile Version : 8
Locality group : <DEFAULT>
Num blocks : 21,758
Index level 1 : 3,048 bytes 1 blocks
Index level 0 : 1,873,885 bytes 8 blocks
First key : um:d:385:%03;%01;10.30.170.244>>o>/2954%af;
data:current [] 4611686019157309597 false
Last key : um:d:395:%03;%01;%ff;
com.facebook>.www>s>/dialog/feed?app_id=90376669494... TRUNCATED data:current
[] -6917529026891043602 false
Num entries : 24,299,468
Column families : [data]
Meta block : BCFile.index
Raw size : 4 bytes
Compressed size : 12 bytes
Compression type : gz
Meta block : RFile.index
Raw size : 3,163 bytes
Compressed size : 1,515 bytes
Compression type : gz
Statistics for keys in data :
Up to size count %-age
10 : 10768926 26.51%
100 : 13471699 70.82%
1000 : 58725 2.56%
10000 : 112 0.07%
100000 : 6 0.04%
1000000 : 0 0.00%
10000000 : 0 0.00%
100000000 : 0 0.00%
1000000000 : 0 0.00%
10000000000 : 0 0.00%
min: 31.00 max: 330,380.00 avg: 122.99 stddev: 157.51
Statistics for keys in index :
Up to size count %-age
10 : 18153 68.40%
100 : 3602 31.43%
1000 : 1 0.17%
10000 : 0 0.00%
100000 : 0 0.00%
1000000 : 0 0.00%
10000000 : 0 0.00%
100000000 : 0 0.00%
1000000000 : 0 0.00%
10000000000 : 0 0.00%
min: 9.00 max: 2,134.00 avg: 58.49 stddev: 36.23
$
{noformat}
> optimize index size in RFile
> ----------------------------
>
> Key: ACCUMULO-1124
> URL: https://issues.apache.org/jira/browse/ACCUMULO-1124
> Project: Accumulo
> Issue Type: Improvement
> Reporter: Eric Newton
> Assignee: Keith Turner
> Fix For: 1.8.0
>
> Time Spent: 2h
> Remaining Estimate: 0h
>
> I noticed HBASE-7845 and it seems like something we could do in RFile, too.
> Instead of putting the whole key in the index, you put in enough of the key
> to get the reader to the beginning of the block.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)