I am not sure I understand when the statistics would be calculated. Would they always be calculated or just when analyze is called? Would it be possible to save analysis results as part of dataframe saving (e.g. when writing it to parquet) or do we have to have a consistent hive installation? Would it be possible to provide the hints manually? For example for streaming if I know the data in the beginning is not a representative of the entire stream?
From: rxin [via Apache Spark Developers List] [mailto:ml-node+s1001551n19873...@n3.nabble.com] Sent: Tuesday, November 15, 2016 8:48 AM To: Mendelson, Assaf Subject: Re: statistics collection and propagation for cost-based optimizer They are not yet complete. The benchmark was done with an implementation of cost-based optimizer Huawei had internally for Spark 1.5 (or some even older version). On Mon, Nov 14, 2016 at 10:46 PM, Yogesh Mahajan <[hidden email]</user/SendEmail.jtp?type=node&node=19873&i=0>> wrote: It looks like Huawei team have run TPC-H benchmark and some real-world test cases and their results show good performance gain in 2X-5X speedup depending on data volume. Can we share the numbers and query wise rational behind the gain? Are there anything done on spark master yet? Or the implementation is not yet completed? Thanks, Yogesh Mahajan http://www.snappydata.io/blog<http://snappydata.io> On Tue, Nov 15, 2016 at 12:03 PM, Yogesh Mahajan <[hidden email]</user/SendEmail.jtp?type=node&node=19873&i=1>> wrote: Thanks Reynold for the detailed proposals. A few questions/clarifications - 1) How the existing rule based operator co-exist with CBO? The existing rules are heuristics/empirical based, i am assuming rules like predicate pushdown or project pruning will co-exist with CBO and we just want to accurately estimate the filter factor and cardinality to make it more accurate? With predicate pushdown, a filter is mostly executed at an early stage of a query plan and the cardinality estimate of a predicate can improve the precision of cardinality estimates. 2. Will the query transformations be now based on the cost calculation? If yes, then what happens when the cost of execution of the transformed statement is higher than the cost of untransformed query? 3. Is there any upper limit on space used for storing the frequency histogram? 255? And in case of more distinct values, we can even consider height balanced histogram in Oracle. 4. The first three proposals are new and not mentioned in the CBO design spec. CMS is good but it's less accurate compared the traditional histograms. This is a major trade-off we need to consider. 5. Are we going to consider system statistics- such as speed of CPU or disk access as a cost function? How about considering shuffle cost, output partitioning etc? 6. Like the current rule based optimizer, will this CBO also be an 'extensible optimizer'? If yes, what functionality users can extend? 7. Why this CBO will be disabled by default? “spark.sql.cbo" is false by default as it's just experimental ? 8. ANALYZE TABLE, analyzeColumns etc ... all look good. 9. From the release point of view, how this is planned ? Will all this be implemented in one go or in phases? Thanks, Yogesh Mahajan http://www.snappydata.io/blog<http://snappydata.io> On Mon, Nov 14, 2016 at 11:25 PM, Reynold Xin <[hidden email]</user/SendEmail.jtp?type=node&node=19873&i=2>> wrote: Historically tpcds and tpch. There is certainly a chance of overfitting one or two benchmarks. Note that those will probably be impacted more by the way we set the parameters for CBO rather than using x or y for summary statistics. On Monday, November 14, 2016, Shivaram Venkataraman <[hidden email]</user/SendEmail.jtp?type=node&node=19873&i=3>> wrote: Do we have any query workloads for which we can benchmark these proposals in terms of performance ? Thanks Shivaram On Sun, Nov 13, 2016 at 5:53 PM, Reynold Xin <[hidden email]</user/SendEmail.jtp?type=node&node=19873&i=4>> wrote: > One additional note: in terms of size, the size of a count-min sketch with > eps = 0.1% and confidence 0.87, uncompressed, is 48k bytes. > > To look up what that means, see > http://spark.apache.org/docs/latest/api/java/org/apache/spark/util/sketch/CountMinSketch.html > > > > > > On Sun, Nov 13, 2016 at 5:30 PM, Reynold Xin <[hidden > email]</user/SendEmail.jtp?type=node&node=19873&i=5>> wrote: >> >> I want to bring this discussion to the dev list to gather broader >> feedback, as there have been some discussions that happened over multiple >> JIRA tickets (SPARK-16026, etc) and GitHub pull requests about what >> statistics to collect and how to use them. >> >> There are some basic statistics on columns that are obvious to use and we >> don't need to debate these: estimated size (in bytes), row count, min, max, >> number of nulls, number of distinct values, average column length, max >> column length. >> >> In addition, we want to be able to estimate selectivity for equality and >> range predicates better, especially taking into account skewed values and >> outliers. >> >> Before I dive into the different options, let me first explain count-min >> sketch: Count-min sketch is a common sketch algorithm that tracks frequency >> counts. It has the following nice properties: >> - sublinear space >> - can be generated in one-pass in a streaming fashion >> - can be incrementally maintained (i.e. for appending new data) >> - it's already implemented in Spark >> - more accurate for frequent values, and less accurate for less-frequent >> values, i.e. it tracks skewed values well. >> - easy to compute inner product, i.e. trivial to compute the count-min >> sketch of a join given two count-min sketches of the join tables >> >> >> Proposal 1 is is to use a combination of count-min sketch and equi-height >> histograms. In this case, count-min sketch will be used for selectivity >> estimation on equality predicates, and histogram will be used on range >> predicates. >> >> Proposal 2 is to just use count-min sketch on equality predicates, and >> then simple selected_range / (max - min) will be used for range predicates. >> This will be less accurate than using histogram, but simpler because we >> don't need to collect histograms. >> >> Proposal 3 is a variant of proposal 2, and takes into account that skewed >> values can impact selectivity heavily. In 3, we track the list of heavy >> hitters (HH, most frequent items) along with count-min sketch on the column. >> Then: >> - use count-min sketch on equality predicates >> - for range predicates, estimatedFreq = sum(freq(HHInRange)) + range / >> (max - min) >> >> Proposal 4 is to not use any sketch, and use histogram for high >> cardinality columns, and exact (value, frequency) pairs for low cardinality >> columns (e.g. num distinct value <= 255). >> >> Proposal 5 is a variant of proposal 4, and adapts it to track exact >> (value, frequency) pairs for the most frequent values only, so we can still >> have that for high cardinality columns. This is actually very similar to >> count-min sketch, but might use less space, although requiring two passes to >> compute the initial value, and more difficult to compute the inner product >> for joins. >> >> >> > ________________________________ If you reply to this email, your message will be added to the discussion below: http://apache-spark-developers-list.1001551.n3.nabble.com/statistics-collection-and-propagation-for-cost-based-optimizer-tp19845p19873.html To start a new topic under Apache Spark Developers List, email ml-node+s1001551n1...@n3.nabble.com To unsubscribe from Apache Spark Developers List, click here<http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=1&code=YXNzYWYubWVuZGVsc29uQHJzYS5jb218MXwtMTI4OTkxNTg1Mg==>. NAML<http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml> -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/statistics-collection-and-propagation-for-cost-based-optimizer-tp19845p19876.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com.