[ 
https://issues.apache.org/jira/browse/ACCUMULO-1124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15297642#comment-15297642
 ] 

Josh Elser commented on ACCUMULO-1124:
--------------------------------------

bq. I experimented with shortening keys in the index and that gave some nice 
improvements, but not as much as I expected. I realized that even with those 
changes, bad keys were still being placed in the index. I added code to keep 
statistics on key sizes and used those statistics to try to select keys that 
were <=AVG(keySize). I also excluded keys that were too big (greater than 3 std 
dev from the mean).

I had the thought "how would we determine when index size is efficient" in the 
future (both evaluating the success of this change as well as identifying perf 
issues in the future). Did you give any thought about how we could expose this 
information more easily? Maybe we include some extra information in the file 
entry in metadata so that the master/monitor could easily aggregate/report on 
file statistics? Not suggesting it needs to happen now, but wondering your 
thoughts (since I assume you were doing all this investigation by hand).

> optimize index size in RFile
> ----------------------------
>
>                 Key: ACCUMULO-1124
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-1124
>             Project: Accumulo
>          Issue Type: Improvement
>            Reporter: Eric Newton
>            Assignee: Keith Turner
>             Fix For: 1.8.0
>
>          Time Spent: 1h
>  Remaining Estimate: 0h
>
> I noticed HBASE-7845 and it seems like something we could do in RFile, too.
> Instead of putting the whole key in the index, you put in enough of the key 
> to get the reader to the beginning of the block.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to