Ioana Delaney created SPARK-17626:

             Summary: TPC-DS performance improvements using star-schema 
                 Key: SPARK-17626
             Project: Spark
          Issue Type: Umbrella
          Components: SQL
    Affects Versions: 2.1.0
            Reporter: Ioana Delaney
            Priority: Critical

*TPC-DS performance improvements using star-schema heuristics*
TPC-DS consists of multiple snowflake schema, which are multiple star schema 
with dimensions linking to dimensions. A star schema consists of a fact table 
referencing a number of dimension tables. Fact table holds the main data about 
a business. Dimension table, a usually smaller table, describes data reflecting 
the dimension/attribute of a business.
As part of the benchmark performance investigation, we observed a pattern of 
sub-optimal execution plans of large fact tables joins. Manual rewrite of some 
of the queries into selective fact-dimensions joins resulted in significant 
performance improvement. This prompted us to develop a simple join reordering 
algorithm based on star schema detection. The performance testing using *1TB 
TPC-DS workload* shows an overall improvement of *19%*. 
*Summary of the results:*
Passed                 99
Failed                  0
Total q time (s)   14,962
Max time            1,467
Min time                3
Mean time             145
Geomean                44

*Compared to baseline* (Negative = improvement; Positive = Degradation):
End to end improved (%)              -19%       
Mean time improved (%)               -19%
Geomean improved (%)                 -24%
End to end improved (seconds)      -3,603
Number of queries improved (>10%)      45
Number of queries degraded (>10%)       6
Number of queries unchanged            48
Top 10 queries improved (%)          -20%

Cluster: 20-node cluster with each node having:
* 10 2TB hard disks in a JBOD configuration, 2 Intel(R) Xeon(R) CPU E5-2680 v2 
@ 2.80GHz processors, 128 GB RAM, 10Gigabit Ethernet.
* Total memory for the cluster: 2.5TB
* Total storage: 400TB
* Total CPU cores: 480

Hadoop stack: IBM Open Platform with Apache Hadoop v4.2. Apache Spark 2.0 GA

Database info:
* Schema: TPCDS 
* Scale factor: 1TB total space
* Storage format: Parquet with Snappy compression

Our investigation and results are included in the attached document.

There are two parts to this improvement:
# Join reordering using star schema detection
# New selectivity hint to specify the selectivity of the predicates over base 

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

Reply via email to