Re: statistics collection and propagation for cost-based optimizer

Shivaram Venkataraman Mon, 14 Nov 2016 09:34:13 -0800

Do we have any query workloads for which we can benchmark these
proposals in terms of performance ?


Thanks
Shivaram

On Sun, Nov 13, 2016 at 5:53 PM, Reynold Xin <r...@databricks.com> wrote:
> One additional note: in terms of size, the size of a count-min sketch with
> eps = 0.1% and confidence 0.87, uncompressed, is 48k bytes.
>
> To look up what that means, see
> http://spark.apache.org/docs/latest/api/java/org/apache/spark/util/sketch/CountMinSketch.html
>
>
>
>
>
> On Sun, Nov 13, 2016 at 5:30 PM, Reynold Xin <r...@databricks.com> wrote:
>>
>> I want to bring this discussion to the dev list to gather broader
>> feedback, as there have been some discussions that happened over multiple
>> JIRA tickets (SPARK-16026, etc) and GitHub pull requests about what
>> statistics to collect and how to use them.
>>
>> There are some basic statistics on columns that are obvious to use and we
>> don't need to debate these: estimated size (in bytes), row count, min, max,
>> number of nulls, number of distinct values, average column length, max
>> column length.
>>
>> In addition, we want to be able to estimate selectivity for equality and
>> range predicates better, especially taking into account skewed values and
>> outliers.
>>
>> Before I dive into the different options, let me first explain count-min
>> sketch: Count-min sketch is a common sketch algorithm that tracks frequency
>> counts. It has the following nice properties:
>> - sublinear space
>> - can be generated in one-pass in a streaming fashion
>> - can be incrementally maintained (i.e. for appending new data)
>> - it's already implemented in Spark
>> - more accurate for frequent values, and less accurate for less-frequent
>> values, i.e. it tracks skewed values well.
>> - easy to compute inner product, i.e. trivial to compute the count-min
>> sketch of a join given two count-min sketches of the join tables
>>
>>
>> Proposal 1 is is to use a combination of count-min sketch and equi-height
>> histograms. In this case, count-min sketch will be used for selectivity
>> estimation on equality predicates, and histogram will be used on range
>> predicates.
>>
>> Proposal 2 is to just use count-min sketch on equality predicates, and
>> then simple selected_range / (max - min) will be used for range predicates.
>> This will be less accurate than using histogram, but simpler because we
>> don't need to collect histograms.
>>
>> Proposal 3 is a variant of proposal 2, and takes into account that skewed
>> values can impact selectivity heavily. In 3, we track the list of heavy
>> hitters (HH, most frequent items) along with count-min sketch on the column.
>> Then:
>> - use count-min sketch on equality predicates
>> - for range predicates, estimatedFreq =  sum(freq(HHInRange)) + range /
>> (max - min)
>>
>> Proposal 4 is to not use any sketch, and use histogram for high
>> cardinality columns, and exact (value, frequency) pairs for low cardinality
>> columns (e.g. num distinct value <= 255).
>>
>> Proposal 5 is a variant of proposal 4, and adapts it to track exact
>> (value, frequency) pairs for the most frequent values only, so we can still
>> have that for high cardinality columns. This is actually very similar to
>> count-min sketch, but might use less space, although requiring two passes to
>> compute the initial value, and more difficult to compute the inner product
>> for joins.
>>
>>
>>
>

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: statistics collection and propagation for cost-based optimizer

Reply via email to