Yan created SPARK-17375:
---------------------------
Summary: Star Join Optimization
Key: SPARK-17375
URL: https://issues.apache.org/jira/browse/SPARK-17375
Project: Spark
Issue Type: Improvement
Components: SQL
Reporter: Yan
The star schema is the simplest style of data mart schema and is the approach
often seen in BI/Decision Support systems. Star Join is a popular SQL query
pattern that joins one or (a few) fact tables with a few dimension tables in
star schemas. Star Join Query Optimizations aim to optimize the performance and
use of resource for the star joins.
Currently the existing Spark SQL optimization works on broadcasting the usually
small (after filtering and projection) dimension tables to avoid costly
shuffling of fact table and the "reduce" operations based on the join keys.
This improvement proposal tries to further improve the broadcast star joins in
the two areas:
1) avoid materialization of the intermediate rows that otherwise could
eventually not make to the final result row set after further joined with other
dimensions that are more restricting;
2) avoid the performance variations among different join orders. This could
also have been largely achieved by cost analysis and heuristics and selecting a
reasonably optimal join order. But we are here trying to achieve similar
improvement without relying on such info.
A preliminary test against a small TPCDS 1GB data set indicates between 5%-40%
improvement (with codegen disabled on both tests) vs. the multiple broadcast
joins on one Query (Q27) that inner joins 4 dimension table with one fact
table. The large variation (5%-40%) is due to the different join ordering of
the 4 broadcast joins. Tests using larger data sets and other TPCDS queries are
yet to be performed.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]