[ 
https://issues.apache.org/jira/browse/ACCUMULO-1124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15298561#comment-15298561
 ] 

Keith Turner commented on ACCUMULO-1124:
----------------------------------------

I pushed a commit to the PR that adds a {{--keyStats}} option to rfile-info.  
Below is the output of running this command on the original file.  Can see that 
the 6 largest keys all ended up in the index.  Also the average key size in the 
index is over twice that of the data.  

{noformat}
$ accumulo rfile-info --keyStats ~/A000rxoi.rf 
Reading file: file:/home/fluo/A000rxoi.rf
RFile Version            : 7

Locality group           : notify
        Start block            : 0
        Num   blocks           : 0
        Index level 0          : 0 bytes  1 blocks
        First key              : null
        Last key               : null
        Num entries            : 0
        Column families        : [ntfy]
Locality group           : <DEFAULT>
        Start block            : 0
        Num   blocks           : 21,818
        Index level 3          : 120,581 bytes  1 blocks
        Index level 2          : 451,008 bytes  2 blocks
        Index level 1          : 714,687 bytes  3 blocks
        Index level 0          : 6,915,137 bytes  25 blocks
        First key              : um:d:385:%03;%01;10.30.170.244>>o>/2954%af; 
data:current [] 4611686019157309597 false
        Last key               : um:d:395:%03;%01;%ff; 
com.facebook>.www>s>/dialog/feed?app_id=90376669494... TRUNCATED data:current 
[] -6917529026891043602 false
        Num entries            : 24,299,468
        Column families        : [data]

Meta block     : BCFile.index
      Raw size             : 4 bytes
      Compressed size      : 12 bytes
      Compression type     : gz

Meta block     : RFile.index
      Raw size             : 120,754 bytes
      Compressed size      : 21,719 bytes
      Compression type     : gz


Statistics for keys in data :
        Up to size      count      %-age
                 10 :   10768926  26.51%
                100 :   13471699  70.82%
               1000 :      58725   2.56%
              10000 :        112   0.07%
             100000 :          6   0.04%
            1000000 :          0   0.00%
           10000000 :          0   0.00%
          100000000 :          0   0.00%
         1000000000 :          0   0.00%
        10000000000 :          0   0.00%

        min:      31.00 max: 330,380.00 avg:     122.99 stddev:     157.51

Statistics for keys in index :
        Up to size      count      %-age
                 10 :       6192   7.67%
                100 :      15024  49.96%
               1000 :        578  13.21%
              10000 :         18   8.83%
             100000 :          6  20.33%
            1000000 :          0   0.00%
           10000000 :          0   0.00%
          100000000 :          0   0.00%
         1000000000 :          0   0.00%
        10000000000 :          0   0.00%

        min:      36.00 max: 330,380.00 avg:     281.73 stddev:   3,901.56
$
{noformat}

Below is the output of running this command on a file compacted using the code 
in the PR.  None of the largest keys are in the index and the average key size 
in the index is less than half of whats in the data.

{noformat}
$ accumulo rfile-info --keyStats /accumulo/tables/2/default_tablet/A0000005.rf
Reading file: 
hdfs://localhost:10000/accumulo/tables/2/default_tablet/A0000005.rf
RFile Version            : 8

Locality group           : <DEFAULT>
        Num   blocks           : 21,758
        Index level 1          : 3,048 bytes  1 blocks
        Index level 0          : 1,873,885 bytes  8 blocks
        First key              : um:d:385:%03;%01;10.30.170.244>>o>/2954%af; 
data:current [] 4611686019157309597 false
        Last key               : um:d:395:%03;%01;%ff; 
com.facebook>.www>s>/dialog/feed?app_id=90376669494... TRUNCATED data:current 
[] -6917529026891043602 false
        Num entries            : 24,299,468
        Column families        : [data]

Meta block     : BCFile.index
      Raw size             : 4 bytes
      Compressed size      : 12 bytes
      Compression type     : gz

Meta block     : RFile.index
      Raw size             : 3,163 bytes
      Compressed size      : 1,515 bytes
      Compression type     : gz


Statistics for keys in data :
        Up to size      count      %-age
                 10 :   10768926  26.51%
                100 :   13471699  70.82%
               1000 :      58725   2.56%
              10000 :        112   0.07%
             100000 :          6   0.04%
            1000000 :          0   0.00%
           10000000 :          0   0.00%
          100000000 :          0   0.00%
         1000000000 :          0   0.00%
        10000000000 :          0   0.00%

        min:      31.00 max: 330,380.00 avg:     122.99 stddev:     157.51

Statistics for keys in index :
        Up to size      count      %-age
                 10 :      18153  68.40%
                100 :       3602  31.43%
               1000 :          1   0.17%
              10000 :          0   0.00%
             100000 :          0   0.00%
            1000000 :          0   0.00%
           10000000 :          0   0.00%
          100000000 :          0   0.00%
         1000000000 :          0   0.00%
        10000000000 :          0   0.00%

        min:       9.00 max:   2,134.00 avg:      58.49 stddev:      36.23
$
{noformat}

> optimize index size in RFile
> ----------------------------
>
>                 Key: ACCUMULO-1124
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-1124
>             Project: Accumulo
>          Issue Type: Improvement
>            Reporter: Eric Newton
>            Assignee: Keith Turner
>             Fix For: 1.8.0
>
>          Time Spent: 2h
>  Remaining Estimate: 0h
>
> I noticed HBASE-7845 and it seems like something we could do in RFile, too.
> Instead of putting the whole key in the index, you put in enough of the key 
> to get the reader to the beginning of the block.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to