[ https://issues.apache.org/jira/browse/ACCUMULO-4501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15585370#comment-15585370 ]
Keith Turner commented on ACCUMULO-4501: ---------------------------------------- [~elserj] as promised on IRC, here is a write up. This covers what [~rweeks] and I discussed at the Accumulo Summit Hackathon. Users could configure, per table, an implementation of CompactionSummarizer. {code:java} interface Counters { void increment(String counter, long amount); void increment(ByteSequence counter, long amount); // I thought of use cases where I would want to append a prefix to the counter. We could // offer this as primitive so that each user does not have to figure out how to do this efficiently. // Simple example of uses cases would be "fam:" and "vis:" prefixes for counting column // families and visibility. void increment(String prefix, ByteSequence counter, long amount); } {code} {code:java} interface CompactionSummarizer { void summarize(Key k, Value v, Counters counters); } {code} When a CompactionSummarizer is configured, Accumulo could do the following at compaction time. * Compute a histogram during compaction by calling CompactionSummarizer for each Key Value added to RFile * Limit the histogram to a max size * Store histogram in RFile * Store name of summarizer in RFile * Store if histogram exceeded max size in RFile We could modify rfile-info to print this information when its present in an RFile. We could also offer a use level API to fetch this information. The API could offer the following. * Require user to specify the name of the CompactionSummarizer they want histograms for. This is so that RFiles containing histograms generated by a different CompactionSummarizer can be ignored. * Allow user to compute histogram for a row range. * Along with returned histogram, indicate if histograms were missing from RFiles or exceeded max size. We discussed an implementation similar to the BatchScanner in that it would send request out to TabletServers to fetch info in parallel. Histograms could be combined at the tablet, tablet server, and client. Thinking about this a little more after the summit I realized this implementation may double count files that span multiple tablets. Another possible implementation would be to gather the unique set of files in the range, and then farm out to the tablet servers aggregating the histograms. This approach makes it hard to possibly cache the serialized histograms. We also discussed if the in memory map should keep a histogram, but came to no conclusion on this. > Add support to RFile to track and store the histogram > ----------------------------------------------------- > > Key: ACCUMULO-4501 > URL: https://issues.apache.org/jira/browse/ACCUMULO-4501 > Project: Accumulo > Issue Type: Sub-task > Components: client, tserver > Reporter: Josh Elser > Assignee: Josh Elser > Time Spent: 1h > Remaining Estimate: 0h > > Modify RFile such that it can build the histogram and store it in an RFile. > Reading the RFile would deserialize the histogram back into memory. -- This message was sent by Atlassian JIRA (v6.3.4#6332)