Yan created SPARK-17375:
---------------------------

             Summary: Star Join Optimization
                 Key: SPARK-17375
                 URL: https://issues.apache.org/jira/browse/SPARK-17375
             Project: Spark
          Issue Type: Improvement
          Components: SQL
            Reporter: Yan


The star schema is the simplest style of data mart schema and is the approach 
often seen in BI/Decision Support systems. Star Join is a popular SQL query 
pattern that joins one or (a few) fact tables with a few dimension tables in 
star schemas. Star Join Query Optimizations aim to optimize the performance and 
use of resource for the star joins.

Currently the existing Spark SQL optimization works on broadcasting the usually 
small (after filtering and projection) dimension tables to avoid costly 
shuffling of fact table and the "reduce" operations based on the join keys.

This improvement proposal tries to further improve the broadcast star joins in 
the two areas:
1) avoid materialization of the intermediate rows that otherwise could 
eventually not make to the final result row set after further joined with other 
dimensions that are more restricting;
2) avoid the performance variations among different join orders. This could 
also have been largely achieved by cost analysis and heuristics and selecting a 
reasonably optimal join order. But we are here trying to achieve similar 
improvement without relying on such info.

A preliminary test against a small TPCDS 1GB data set indicates between 5%-40% 
improvement (with codegen disabled on both tests) vs. the multiple broadcast 
joins on one Query (Q27) that inner joins 4 dimension table with one fact 
table. The large variation (5%-40%) is due to  the different join ordering of 
the 4 broadcast joins. Tests using larger data sets and other TPCDS queries are 
yet to be performed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to