RE: statistics collection and propagation for cost-based optimizer

assaf.mendelson Mon, 14 Nov 2016 23:15:07 -0800

I am not sure I understand when the statistics would be calculated. Would they 
always be calculated or just when analyze is called?
Would it be possible to save analysis results as part of dataframe saving (e.g. 
when writing it to parquet) or do we have to have a consistent hive 
installation?
Would it be possible to provide the hints manually? For example for streaming 
if I know the data in the beginning is not a representative of the entire 
stream?

From: rxin [via Apache Spark Developers List] 
[mailto:ml-node+s1001551n19873...@n3.nabble.com]
Sent: Tuesday, November 15, 2016 8:48 AM
To: Mendelson, Assaf
Subject: Re: statistics collection and propagation for cost-based optimizer

They are not yet complete. The benchmark was done with an implementation of 
cost-based optimizer Huawei had internally for Spark 1.5 (or some even older 
version).

On Mon, Nov 14, 2016 at 10:46 PM, Yogesh Mahajan <[hidden 
email]</user/SendEmail.jtp?type=node&node=19873&i=0>> wrote:
It looks like Huawei team have run TPC-H benchmark and some real-world test 
cases and their results show good performance gain in 2X-5X speedup depending 
on data volume.
Can we share the numbers and query wise rational behind the gain? Are there 
anything done on spark master yet? Or the implementation is not yet completed?

Thanks,
Yogesh Mahajan
http://www.snappydata.io/blog<http://snappydata.io>

On Tue, Nov 15, 2016 at 12:03 PM, Yogesh Mahajan <[hidden 
email]</user/SendEmail.jtp?type=node&node=19873&i=1>> wrote:

Thanks Reynold for the detailed proposals. A few questions/clarifications -

1) How the existing rule based operator co-exist with CBO? The existing rules 
are heuristics/empirical based, i am assuming rules like predicate pushdown or 
project pruning will co-exist with CBO and we just want to accurately estimate 
the filter factor and cardinality to make it more accurate? With predicate 
pushdown, a filter is mostly executed at an early stage of a query plan and the 
cardinality estimate of a predicate can improve the precision of cardinality 
estimates.

2. Will the query transformations be now based on the cost calculation? If yes, 
then what happens when the cost of execution of the transformed statement is 
higher than the cost of untransformed query?

3. Is there any upper limit on space used for storing the frequency histogram? 
255? And in case of more distinct values, we can even consider height balanced 
histogram in Oracle.

4. The first three proposals are new and not mentioned in the CBO design spec. 
CMS is good but it's less accurate compared the traditional histograms. This is 
a  major trade-off  we need to consider.

5. Are we going to consider system statistics- such as speed of CPU or disk 
access as a cost function? How about considering shuffle cost, output 
partitioning etc?

6. Like the current rule based optimizer, will this CBO also be an 'extensible 
optimizer'? If yes, what functionality users can extend?

7. Why this CBO will be disabled by default? “spark.sql.cbo" is false by 
default as it's just experimental ?

8. ANALYZE TABLE, analyzeColumns etc ... all look good.

9. From the release point of view, how this is planned ? Will all this be 
implemented in one go or in phases?

Thanks,
Yogesh Mahajan
http://www.snappydata.io/blog<http://snappydata.io>

On Mon, Nov 14, 2016 at 11:25 PM, Reynold Xin <[hidden 
email]</user/SendEmail.jtp?type=node&node=19873&i=2>> wrote:
Historically tpcds and tpch. There is certainly a chance of overfitting one or 
two benchmarks. Note that those will probably be impacted more by the way we 
set the parameters for CBO rather than using x or y for summary statistics.

On Monday, November 14, 2016, Shivaram Venkataraman <[hidden 
email]</user/SendEmail.jtp?type=node&node=19873&i=3>> wrote:
Do we have any query workloads for which we can benchmark these
proposals in terms of performance ?

Thanks
Shivaram

On Sun, Nov 13, 2016 at 5:53 PM, Reynold Xin <[hidden 
email]</user/SendEmail.jtp?type=node&node=19873&i=4>> wrote:
> One additional note: in terms of size, the size of a count-min sketch with
> eps = 0.1% and confidence 0.87, uncompressed, is 48k bytes.
>
> To look up what that means, see
> http://spark.apache.org/docs/latest/api/java/org/apache/spark/util/sketch/CountMinSketch.html
>
>
>
>
>
> On Sun, Nov 13, 2016 at 5:30 PM, Reynold Xin <[hidden 
> email]</user/SendEmail.jtp?type=node&node=19873&i=5>> wrote:
>>
>> I want to bring this discussion to the dev list to gather broader
>> feedback, as there have been some discussions that happened over multiple
>> JIRA tickets (SPARK-16026, etc) and GitHub pull requests about what
>> statistics to collect and how to use them.
>>
>> There are some basic statistics on columns that are obvious to use and we
>> don't need to debate these: estimated size (in bytes), row count, min, max,
>> number of nulls, number of distinct values, average column length, max
>> column length.
>>
>> In addition, we want to be able to estimate selectivity for equality and
>> range predicates better, especially taking into account skewed values and
>> outliers.
>>
>> Before I dive into the different options, let me first explain count-min
>> sketch: Count-min sketch is a common sketch algorithm that tracks frequency
>> counts. It has the following nice properties:
>> - sublinear space
>> - can be generated in one-pass in a streaming fashion
>> - can be incrementally maintained (i.e. for appending new data)
>> - it's already implemented in Spark
>> - more accurate for frequent values, and less accurate for less-frequent
>> values, i.e. it tracks skewed values well.
>> - easy to compute inner product, i.e. trivial to compute the count-min
>> sketch of a join given two count-min sketches of the join tables
>>
>>
>> Proposal 1 is is to use a combination of count-min sketch and equi-height
>> histograms. In this case, count-min sketch will be used for selectivity
>> estimation on equality predicates, and histogram will be used on range
>> predicates.
>>
>> Proposal 2 is to just use count-min sketch on equality predicates, and
>> then simple selected_range / (max - min) will be used for range predicates.
>> This will be less accurate than using histogram, but simpler because we
>> don't need to collect histograms.
>>
>> Proposal 3 is a variant of proposal 2, and takes into account that skewed
>> values can impact selectivity heavily. In 3, we track the list of heavy
>> hitters (HH, most frequent items) along with count-min sketch on the column.
>> Then:
>> - use count-min sketch on equality predicates
>> - for range predicates, estimatedFreq =  sum(freq(HHInRange)) + range /
>> (max - min)
>>
>> Proposal 4 is to not use any sketch, and use histogram for high
>> cardinality columns, and exact (value, frequency) pairs for low cardinality
>> columns (e.g. num distinct value <= 255).
>>
>> Proposal 5 is a variant of proposal 4, and adapts it to track exact
>> (value, frequency) pairs for the most frequent values only, so we can still
>> have that for high cardinality columns. This is actually very similar to
>> count-min sketch, but might use less space, although requiring two passes to
>> compute the initial value, and more difficult to compute the inner product
>> for joins.
>>
>>
>>
>

________________________________
If you reply to this email, your message will be added to the discussion below:
http://apache-spark-developers-list.1001551.n3.nabble.com/statistics-collection-and-propagation-for-cost-based-optimizer-tp19845p19873.html
To start a new topic under Apache Spark Developers List, email 
ml-node+s1001551n1...@n3.nabble.com
To unsubscribe from Apache Spark Developers List, click 
here<http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=1&code=YXNzYWYubWVuZGVsc29uQHJzYS5jb218MXwtMTI4OTkxNTg1Mg==>.
NAML<http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>

--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/statistics-collection-and-propagation-for-cost-based-optimizer-tp19845p19876.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

RE: statistics collection and propagation for cost-based optimizer

Reply via email to