[
https://issues.apache.org/jira/browse/ACCUMULO-1124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15285234#comment-15285234
]
Keith Turner commented on ACCUMULO-1124:
----------------------------------------
I was running Fluo's Webindex example on EC2 for a long period. After running
the example I inspected some RFiles. Some of them had larger indexes than I
expected. I suspect making a change mentioned in the ticket would reduce the
index size.
Below is info for an rfile that uses URLs from web pages in keys. I am going
to experiment with generating shorter keys in the index for this file. This
file was generated using 64K data blocks and 256K index blocks.
{noformat}
[centos@leader1 ~]$ accumulo rfile-info --histogram
/accumulo/tables/7/t-0003uq7/A000rxoi.rf
2016-05-16 16:48:38,914 [rfile.PrintInfo] WARN : Attempting to find file across
filesystems. Consider providing URI instead of path
Reading file: hdfs://leader1:10000/accumulo/tables/7/t-0003uq7/A000rxoi.rf
Locality group : notify
Start block : 0
Num blocks : 0
Index level 0 : 0 bytes 1 blocks
First key : null
Last key : null
Num entries : 0
Column families : [ntfy]
Locality group : <DEFAULT>
Start block : 0
Num blocks : 21,818
Index level 3 : 120,581 bytes 1 blocks
Index level 2 : 451,008 bytes 2 blocks
Index level 1 : 714,687 bytes 3 blocks
Index level 0 : 6,915,137 bytes 25 blocks
First key : um:d:385:%03;%01;10.30.170.244>>o>/2954%af;
data:current [] 4611686019157309597 false
Last key : um:d:395:%03;%01;%ff;
com.facebook>.www>s>/dialog/feed?app_id=90376669494... TRUNCATED data:current
[] -6917529026891043602 false
Num entries : 24,299,468
Column families : [data]
Meta block : BCFile.index
Raw size : 4 bytes
Compressed size : 12 bytes
Compression type : gz
Meta block : RFile.index
Raw size : 120,754 bytes
Compressed size : 21,719 bytes
Compression type : gz
Up to size count %-age
10 : 9292962 22.56%
100 : 14947371 74.88%
1000 : 59017 2.45%
10000 : 112 0.07%
100000 : 6 0.04%
1000000 : 0 0.00%
10000000 : 0 0.00%
100000000 : 0 0.00%
1000000000 : 0 0.00%
10000000000 : 0 0.00%
{noformat}
> optimize index size in RFile
> ----------------------------
>
> Key: ACCUMULO-1124
> URL: https://issues.apache.org/jira/browse/ACCUMULO-1124
> Project: Accumulo
> Issue Type: Improvement
> Reporter: Eric Newton
> Assignee: Keith Turner
>
> I noticed HBASE-7845 and it seems like something we could do in RFile, too.
> Instead of putting the whole key in the index, you put in enough of the key
> to get the reader to the beginning of the block.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)