Do we have any query workloads for which we can benchmark these proposals in terms of performance ?
Thanks Shivaram On Sun, Nov 13, 2016 at 5:53 PM, Reynold Xin <r...@databricks.com> wrote: > One additional note: in terms of size, the size of a count-min sketch with > eps = 0.1% and confidence 0.87, uncompressed, is 48k bytes. > > To look up what that means, see > http://spark.apache.org/docs/latest/api/java/org/apache/spark/util/sketch/CountMinSketch.html > > > > > > On Sun, Nov 13, 2016 at 5:30 PM, Reynold Xin <r...@databricks.com> wrote: >> >> I want to bring this discussion to the dev list to gather broader >> feedback, as there have been some discussions that happened over multiple >> JIRA tickets (SPARK-16026, etc) and GitHub pull requests about what >> statistics to collect and how to use them. >> >> There are some basic statistics on columns that are obvious to use and we >> don't need to debate these: estimated size (in bytes), row count, min, max, >> number of nulls, number of distinct values, average column length, max >> column length. >> >> In addition, we want to be able to estimate selectivity for equality and >> range predicates better, especially taking into account skewed values and >> outliers. >> >> Before I dive into the different options, let me first explain count-min >> sketch: Count-min sketch is a common sketch algorithm that tracks frequency >> counts. It has the following nice properties: >> - sublinear space >> - can be generated in one-pass in a streaming fashion >> - can be incrementally maintained (i.e. for appending new data) >> - it's already implemented in Spark >> - more accurate for frequent values, and less accurate for less-frequent >> values, i.e. it tracks skewed values well. >> - easy to compute inner product, i.e. trivial to compute the count-min >> sketch of a join given two count-min sketches of the join tables >> >> >> Proposal 1 is is to use a combination of count-min sketch and equi-height >> histograms. In this case, count-min sketch will be used for selectivity >> estimation on equality predicates, and histogram will be used on range >> predicates. >> >> Proposal 2 is to just use count-min sketch on equality predicates, and >> then simple selected_range / (max - min) will be used for range predicates. >> This will be less accurate than using histogram, but simpler because we >> don't need to collect histograms. >> >> Proposal 3 is a variant of proposal 2, and takes into account that skewed >> values can impact selectivity heavily. In 3, we track the list of heavy >> hitters (HH, most frequent items) along with count-min sketch on the column. >> Then: >> - use count-min sketch on equality predicates >> - for range predicates, estimatedFreq = sum(freq(HHInRange)) + range / >> (max - min) >> >> Proposal 4 is to not use any sketch, and use histogram for high >> cardinality columns, and exact (value, frequency) pairs for low cardinality >> columns (e.g. num distinct value <= 255). >> >> Proposal 5 is a variant of proposal 4, and adapts it to track exact >> (value, frequency) pairs for the most frequent values only, so we can still >> have that for high cardinality columns. This is actually very similar to >> count-min sketch, but might use less space, although requiring two passes to >> compute the initial value, and more difficult to compute the inner product >> for joins. >> >> >> > --------------------------------------------------------------------- To unsubscribe e-mail: dev-unsubscr...@spark.apache.org