[ https://issues.apache.org/jira/browse/ACCUMULO-4730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Keith Turner resolved ACCUMULO-4730. ------------------------------------ Resolution: Fixed > Create an Entry length summarizer > --------------------------------- > > Key: ACCUMULO-4730 > URL: https://issues.apache.org/jira/browse/ACCUMULO-4730 > Project: Accumulo > Issue Type: Improvement > Reporter: Keith Turner > Assignee: Jared R > Labels: newbie, pull-request-available > Fix For: 2.0.0 > > Time Spent: 3h > Remaining Estimate: 0h > > It would be very useful to have a built in > [Summarizer|https://github.com/apache/accumulo/blob/master/core/src/main/java/org/apache/accumulo/core/client/summary/Summarizer.java] > that computes summary information about field lengths. Specifically key > length, row length, family length, qualifier length, visibility length, and > value length. Whatever stats are computed must be able to computed > incrementally. For example can incrementally compute min, max, count, sum, > and log2 histogram. I think these would be good stats to start with. Count > and sum can be used to compute the average. There is an example of computing > a log2 histogram in the Summarizer javadoc. > The Summarizer could be named EntryLenghtSummarizer and possibly produce > summaries like the following. > {noformat} > count=XXX //do not need to track this per field, its the same for all > key.min=XXX > key.max=XXX > key.sum=XXX > key.logHist.8=XXX //only output non zero exponents > key.logHist.9=XXX > row.min=XXX > row.max=XXX > row.sum=XXX > row.logHist.7=XXX > row.logHist.8=XXX > row.logHist.10=XXX > family.min=XXX > family.max=XXX > family.sum=XXX > family.logHist.6=XXX > family.logHist.7=XXX > etc... > {noformat} > This new summarizer would be placed in the > [summarizers|https://github.com/apache/accumulo/tree/master/core/src/main/java/org/apache/accumulo/core/client/summary/summarizers] > package. -- This message was sent by Atlassian JIRA (v6.4.14#64029)