[
https://issues.apache.org/jira/browse/ACCUMULO-1124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15297642#comment-15297642
]
Josh Elser commented on ACCUMULO-1124:
--------------------------------------
bq. I experimented with shortening keys in the index and that gave some nice
improvements, but not as much as I expected. I realized that even with those
changes, bad keys were still being placed in the index. I added code to keep
statistics on key sizes and used those statistics to try to select keys that
were <=AVG(keySize). I also excluded keys that were too big (greater than 3 std
dev from the mean).
I had the thought "how would we determine when index size is efficient" in the
future (both evaluating the success of this change as well as identifying perf
issues in the future). Did you give any thought about how we could expose this
information more easily? Maybe we include some extra information in the file
entry in metadata so that the master/monitor could easily aggregate/report on
file statistics? Not suggesting it needs to happen now, but wondering your
thoughts (since I assume you were doing all this investigation by hand).
> optimize index size in RFile
> ----------------------------
>
> Key: ACCUMULO-1124
> URL: https://issues.apache.org/jira/browse/ACCUMULO-1124
> Project: Accumulo
> Issue Type: Improvement
> Reporter: Eric Newton
> Assignee: Keith Turner
> Fix For: 1.8.0
>
> Time Spent: 1h
> Remaining Estimate: 0h
>
> I noticed HBASE-7845 and it seems like something we could do in RFile, too.
> Instead of putting the whole key in the index, you put in enough of the key
> to get the reader to the beginning of the block.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)