Keith Turner created ACCUMULO-4730:
--------------------------------------
Summary: Create an Entry length summarizer
Key: ACCUMULO-4730
URL: https://issues.apache.org/jira/browse/ACCUMULO-4730
Project: Accumulo
Issue Type: Improvement
Reporter: Keith Turner
Fix For: 2.0.0
It would be very useful to have a built in
[Summarizer|https://github.com/apache/accumulo/blob/master/core/src/main/java/org/apache/accumulo/core/client/summary/Summarizer.java]
that computes summary information about field lengths. Specifically key
length, row length, family length, qualifier length, visibility length, and
value length. Whatever stats are computed must be able to computed
incrementally. For example can incrementally compute min, max, count, sum, and
log2 histogram. I think these would be good stats to start with. Count and
sum can be used to compute the average. There is an example of computing a
log2 histogram in the Summarizer javadoc.
The Summarizer could be named EntryLenghtSummarizer and possibly produce
summaries like the following.
{noformat}
count=XXX //do not need to track this per field, its the same for all
key.min=XXX
key.max=XXX
key.sum=XXX
key.logHist.8=XXX //only output non zero exponents
key.logHist.9=XXX
row.min=XXX
row.max=XXX
row.sum=XXX
row.logHist.7=XXX
row.logHist.8=XXX
row.logHist.10=XXX
family.min=XXX
family.max=XXX
family.sum=XXX
family.logHist.6=XXX
family.logHist.7=XXX
etc...
{noformat}
This new summarizer would be placed in the
[summarizers|https://github.com/apache/accumulo/tree/master/core/src/main/java/org/apache/accumulo/core/client/summary/summarizers]
package.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)