One additional note: in terms of size, the size of a count-min sketch with eps = 0.1% and confidence 0.87, uncompressed, is 48k bytes.
To look up what that means, see http://spark.apache.org/docs/latest/api/java/org/apache/spark/util/sketch/CountMinSketch.html On Sun, Nov 13, 2016 at 5:30 PM, Reynold Xin <r...@databricks.com> wrote: > I want to bring this discussion to the dev list to gather broader > feedback, as there have been some discussions that happened over multiple > JIRA tickets (SPARK-16026 > <https://issues.apache.org/jira/browse/SPARK-16026>, etc) and GitHub pull > requests about what statistics to collect and how to use them. > > There are some basic statistics on columns that are obvious to use and we > don't need to debate these: estimated size (in bytes), row count, min, max, > number of nulls, number of distinct values, average column length, max > column length. > > In addition, we want to be able to estimate selectivity for equality and > range predicates better, especially taking into account skewed values and > outliers. > > Before I dive into the different options, let me first explain count-min > sketch: Count-min sketch is a common sketch algorithm that tracks frequency > counts. It has the following nice properties: > - sublinear space > - can be generated in one-pass in a streaming fashion > - can be incrementally maintained (i.e. for appending new data) > - it's already implemented in Spark > - more accurate for frequent values, and less accurate for less-frequent > values, i.e. it tracks skewed values well. > - easy to compute inner product, i.e. trivial to compute the count-min > sketch of a join given two count-min sketches of the join tables > > > Proposal 1 is is to use a combination of count-min sketch and equi-height > histograms. In this case, count-min sketch will be used for selectivity > estimation on equality predicates, and histogram will be used on range > predicates. > > Proposal 2 is to just use count-min sketch on equality predicates, and > then simple selected_range / (max - min) will be used for range predicates. > This will be less accurate than using histogram, but simpler because we > don't need to collect histograms. > > Proposal 3 is a variant of proposal 2, and takes into account that skewed > values can impact selectivity heavily. In 3, we track the list of heavy > hitters (HH, most frequent items) along with count-min sketch on the > column. Then: > - use count-min sketch on equality predicates > - for range predicates, estimatedFreq = sum(freq(HHInRange)) + range / > (max - min) > > Proposal 4 is to not use any sketch, and use histogram for high > cardinality columns, and exact (value, frequency) pairs for low cardinality > columns (e.g. num distinct value <= 255). > > Proposal 5 is a variant of proposal 4, and adapts it to track exact > (value, frequency) pairs for the most frequent values only, so we can still > have that for high cardinality columns. This is actually very similar to > count-min sketch, but might use less space, although requiring two passes > to compute the initial value, and more difficult to compute the inner > product for joins. > > > >