Keith Turner created ACCUMULO-4730:
--------------------------------------

             Summary: Create an Entry length summarizer
                 Key: ACCUMULO-4730
                 URL: https://issues.apache.org/jira/browse/ACCUMULO-4730
             Project: Accumulo
          Issue Type: Improvement
            Reporter: Keith Turner
             Fix For: 2.0.0


It would be very useful to have a built in 
[Summarizer|https://github.com/apache/accumulo/blob/master/core/src/main/java/org/apache/accumulo/core/client/summary/Summarizer.java]
 that computes summary information about field lengths.  Specifically key 
length, row length, family length, qualifier length, visibility length, and 
value length.   Whatever stats are computed must be able to computed 
incrementally.  For example can incrementally compute min, max, count, sum, and 
log2 histogram.  I think these would be good stats to start with.  Count and 
sum can be used to compute the average.  There is an example of computing a 
log2 histogram in the Summarizer javadoc.

The Summarizer could be named EntryLenghtSummarizer and possibly produce 
summaries like the following.  

{noformat}
count=XXX     //do not need to track this per field, its the same for all
key.min=XXX
key.max=XXX
key.sum=XXX
key.logHist.8=XXX   //only output non zero exponents 
key.logHist.9=XXX
row.min=XXX
row.max=XXX
row.sum=XXX
row.logHist.7=XXX
row.logHist.8=XXX
row.logHist.10=XXX
family.min=XXX
family.max=XXX
family.sum=XXX
family.logHist.6=XXX
family.logHist.7=XXX
etc...
{noformat}

This new summarizer would be placed in the 
[summarizers|https://github.com/apache/accumulo/tree/master/core/src/main/java/org/apache/accumulo/core/client/summary/summarizers]
 package.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to