One additional note: in terms of size, the size of a count-min sketch with
eps = 0.1% and confidence 0.87, uncompressed, is 48k bytes.

To look up what that means, see
http://spark.apache.org/docs/latest/api/java/org/apache/spark/util/sketch/CountMinSketch.html





On Sun, Nov 13, 2016 at 5:30 PM, Reynold Xin <r...@databricks.com> wrote:

> I want to bring this discussion to the dev list to gather broader
> feedback, as there have been some discussions that happened over multiple
> JIRA tickets (SPARK-16026
> <https://issues.apache.org/jira/browse/SPARK-16026>, etc) and GitHub pull
> requests about what statistics to collect and how to use them.
>
> There are some basic statistics on columns that are obvious to use and we
> don't need to debate these: estimated size (in bytes), row count, min, max,
> number of nulls, number of distinct values, average column length, max
> column length.
>
> In addition, we want to be able to estimate selectivity for equality and
> range predicates better, especially taking into account skewed values and
> outliers.
>
> Before I dive into the different options, let me first explain count-min
> sketch: Count-min sketch is a common sketch algorithm that tracks frequency
> counts. It has the following nice properties:
> - sublinear space
> - can be generated in one-pass in a streaming fashion
> - can be incrementally maintained (i.e. for appending new data)
> - it's already implemented in Spark
> - more accurate for frequent values, and less accurate for less-frequent
> values, i.e. it tracks skewed values well.
> - easy to compute inner product, i.e. trivial to compute the count-min
> sketch of a join given two count-min sketches of the join tables
>
>
> Proposal 1 is is to use a combination of count-min sketch and equi-height
> histograms. In this case, count-min sketch will be used for selectivity
> estimation on equality predicates, and histogram will be used on range
> predicates.
>
> Proposal 2 is to just use count-min sketch on equality predicates, and
> then simple selected_range / (max - min) will be used for range predicates.
> This will be less accurate than using histogram, but simpler because we
> don't need to collect histograms.
>
> Proposal 3 is a variant of proposal 2, and takes into account that skewed
> values can impact selectivity heavily. In 3, we track the list of heavy
> hitters (HH, most frequent items) along with count-min sketch on the
> column. Then:
> - use count-min sketch on equality predicates
> - for range predicates, estimatedFreq =  sum(freq(HHInRange)) + range /
> (max - min)
>
> Proposal 4 is to not use any sketch, and use histogram for high
> cardinality columns, and exact (value, frequency) pairs for low cardinality
> columns (e.g. num distinct value <= 255).
>
> Proposal 5 is a variant of proposal 4, and adapts it to track exact
> (value, frequency) pairs for the most frequent values only, so we can still
> have that for high cardinality columns. This is actually very similar to
> count-min sketch, but might use less space, although requiring two passes
> to compute the initial value, and more difficult to compute the inner
> product for joins.
>
>
>
>

Reply via email to