Keith Turner commented on ACCUMULO-4501:

[~elserj] as promised on IRC, here is a write up.  This covers what [~rweeks] 
and I discussed at the Accumulo Summit Hackathon.

Users could configure, per table, an implementation of CompactionSummarizer.  

  interface Counters {
    void increment(String counter, long amount);
    void increment(ByteSequence counter, long amount);

   // I thought of use cases where I would want to append a prefix to the 
counter.  We could 
   // offer this as primitive so that each user does not have to figure out how 
to do this efficiently.  
   // Simple example of uses cases would be "fam:" and "vis:" prefixes for 
counting column 
   // families and visibility.
    void increment(String prefix, ByteSequence counter, long amount);

  interface CompactionSummarizer {
     void summarize(Key k, Value v, Counters counters);

When a CompactionSummarizer is configured, Accumulo could do the following at 
compaction time.

 * Compute a histogram during compaction by calling CompactionSummarizer for 
each Key Value added to RFile
 * Limit the histogram to a max size
 * Store histogram in RFile
 * Store name of summarizer in RFile
 * Store if histogram exceeded max size in RFile
We could modify rfile-info to print this information when its present in an 
RFile.  We could also offer a use level API to fetch this information. The API 
could offer the following.

 * Require user to specify the name of the CompactionSummarizer they want 
histograms for.  This is so that RFiles containing histograms generated by a 
different CompactionSummarizer can be ignored.
 * Allow user to compute histogram for a row range.
 * Along with returned histogram, indicate if histograms were missing from 
RFiles or exceeded max size.

We discussed an implementation similar to the BatchScanner in that it would 
send request out to TabletServers to fetch info in parallel.  Histograms could 
be combined at the tablet, tablet server, and client.  Thinking about this a 
little more after the summit I realized this implementation may double count 
files that span multiple tablets.  Another possible implementation would be to 
gather the unique set of files in the range, and then farm out to the tablet 
servers aggregating the histograms.  This approach makes it hard to possibly 
cache the serialized histograms.  We also discussed if the in memory map should 
keep a histogram, but came to no conclusion on this.

> Add support to RFile to track and store the histogram
> -----------------------------------------------------
>                 Key: ACCUMULO-4501
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-4501
>             Project: Accumulo
>          Issue Type: Sub-task
>          Components: client, tserver
>            Reporter: Josh Elser
>            Assignee: Josh Elser
>          Time Spent: 1h
>  Remaining Estimate: 0h
> Modify RFile such that it can build the histogram and store it in an RFile.
> Reading the RFile would deserialize the histogram back into memory.

This message was sent by Atlassian JIRA

Reply via email to