[ 
https://issues.apache.org/jira/browse/ACCUMULO-1124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15285234#comment-15285234
 ] 

Keith Turner commented on ACCUMULO-1124:
----------------------------------------

I was running Fluo's Webindex example on EC2 for a long period.  After running 
the example I inspected some RFiles.  Some of them had larger indexes than I 
expected.  I suspect making a change mentioned in the ticket would reduce the 
index size.

Below is info for an rfile that uses URLs from web pages in keys.  I am going 
to experiment with generating shorter keys in the index for this file.  This 
file was generated using 64K data blocks and 256K index blocks.

{noformat}
[centos@leader1 ~]$ accumulo rfile-info  --histogram 
/accumulo/tables/7/t-0003uq7/A000rxoi.rf
2016-05-16 16:48:38,914 [rfile.PrintInfo] WARN : Attempting to find file across 
filesystems. Consider providing URI instead of path
Reading file: hdfs://leader1:10000/accumulo/tables/7/t-0003uq7/A000rxoi.rf
Locality group         : notify
        Start block          : 0
        Num   blocks         : 0
        Index level 0        : 0 bytes  1 blocks
        First key            : null
        Last key             : null
        Num entries          : 0
        Column families      : [ntfy]
Locality group         : <DEFAULT>
        Start block          : 0
        Num   blocks         : 21,818
        Index level 3        : 120,581 bytes  1 blocks
        Index level 2        : 451,008 bytes  2 blocks
        Index level 1        : 714,687 bytes  3 blocks
        Index level 0        : 6,915,137 bytes  25 blocks
        First key            : um:d:385:%03;%01;10.30.170.244>>o>/2954%af; 
data:current [] 4611686019157309597 false
        Last key             : um:d:395:%03;%01;%ff; 
com.facebook>.www>s>/dialog/feed?app_id=90376669494... TRUNCATED data:current 
[] -6917529026891043602 false
        Num entries          : 24,299,468
        Column families      : [data]

Meta block     : BCFile.index
      Raw size             : 4 bytes
      Compressed size      : 12 bytes
      Compression type     : gz

Meta block     : RFile.index
      Raw size             : 120,754 bytes
      Compressed size      : 21,719 bytes
      Compression type     : gz


Up to size      count      %-age
         10 :    9292962  22.56%
        100 :   14947371  74.88%
       1000 :      59017   2.45%
      10000 :        112   0.07%
     100000 :          6   0.04%
    1000000 :          0   0.00%
   10000000 :          0   0.00%
  100000000 :          0   0.00%
 1000000000 :          0   0.00%
10000000000 :          0   0.00%
{noformat}

> optimize index size in RFile
> ----------------------------
>
>                 Key: ACCUMULO-1124
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-1124
>             Project: Accumulo
>          Issue Type: Improvement
>            Reporter: Eric Newton
>            Assignee: Keith Turner
>
> I noticed HBASE-7845 and it seems like something we could do in RFile, too.
> Instead of putting the whole key in the index, you put in enough of the key 
> to get the reader to the beginning of the block.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to