[ https://issues.apache.org/jira/browse/SPARK-16026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15472000#comment-15472000 ]
Ron Hu commented on SPARK-16026: -------------------------------- Hi Srinath, Thank you for your comments. Let me answer them one by one. First, we should consider the data shuffle cost. Yes, this is part of phase 2 cost functions in our plan. As we already implemented the phase 1 cost function, we want to contribute our existing development work to Spark community ASAP. We will expand to phase 2 CBO work soon. In phase 2, we will develop cost function for each execution operator. The EXCHANGE operator is one we need to define its cost function. Your suggestion is quite reasonable. Second, we define two statements: (1) ANALYZE TABLE table_name COMPUTE STATISTICS; (2) ANALYZE TABLE table_name COMPUTE STATISTICS FOR COLUMNS column-name1, column-name2, …. As you know, the ANALYZE TABLE command collects the auxiliary statistics information. A good DBA needs to monitor the status of the statistics information. I mean there always exists an issue whether or not the statistics data is stale. Hence, we do not want to use the transaction criteria to view statistics data. On the other hand, we may do a little better to make them consistent. One way is to refresh table level statistics when we execute the command "ANALYZE TABLE table_name COMPUTE STATISTICS FOR COLUMNS column-name1, column-name2, …. " to collect column level statistics. Third, we do not have default selectivity assumed. In the design spec, we defined how to estimate the cardinality for logical AND operator in section 6. In the future, we may use either 2-dimensional histogram and/or SQL hint to handle the correlation among multiple correlated columns. > Cost-based Optimizer framework > ------------------------------ > > Key: SPARK-16026 > URL: https://issues.apache.org/jira/browse/SPARK-16026 > Project: Spark > Issue Type: New Feature > Components: SQL > Reporter: Reynold Xin > Attachments: Spark_CBO_Design_Spec.pdf > > > This is an umbrella ticket to implement a cost-based optimizer framework > beyond broadcast join selection. This framework can be used to implement some > useful optimizations such as join reordering. > The design should discuss how to break the work down into multiple, smaller > logical units. For example, changes to statistics class, system catalog, cost > estimation/propagation in expressions, cost estimation/propagation in > operators can be done in decoupled pull requests. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org