Ioana Delaney created SPARK-17626:
-------------------------------------

             Summary: TPC-DS performance improvements using star-schema 
heuristics
                 Key: SPARK-17626
                 URL: https://issues.apache.org/jira/browse/SPARK-17626
             Project: Spark
          Issue Type: Umbrella
          Components: SQL
    Affects Versions: 2.1.0
            Reporter: Ioana Delaney
            Priority: Critical


*TPC-DS performance improvements using star-schema heuristics*
\\
\\
TPC-DS consists of multiple snowflake schema, which are multiple star schema 
with dimensions linking to dimensions. A star schema consists of a fact table 
referencing a number of dimension tables. Fact table holds the main data about 
a business. Dimension table, a usually smaller table, describes data reflecting 
the dimension/attribute of a business.
\\
\\
As part of the benchmark performance investigation, we observed a pattern of 
sub-optimal execution plans of large fact tables joins. Manual rewrite of some 
of the queries into selective fact-dimensions joins resulted in significant 
performance improvement. This prompted us to develop a simple join reordering 
algorithm based on star schema detection. The performance testing using *1TB 
TPC-DS workload* shows an overall improvement of *19%*. 
\\
\\
*Summary of the results:*
{code}
Passed                 99
Failed                  0
Total q time (s)   14,962
Max time            1,467
Min time                3
Mean time             145
Geomean                44
{code}

*Compared to baseline* (Negative = improvement; Positive = Degradation):
{code}
End to end improved (%)              -19%       
Mean time improved (%)               -19%
Geomean improved (%)                 -24%
End to end improved (seconds)      -3,603
Number of queries improved (>10%)      45
Number of queries degraded (>10%)       6
Number of queries unchanged            48
Top 10 queries improved (%)          -20%
{code}

Cluster: 20-node cluster with each node having:
* 10 2TB hard disks in a JBOD configuration, 2 Intel(R) Xeon(R) CPU E5-2680 v2 
@ 2.80GHz processors, 128 GB RAM, 10Gigabit Ethernet.
* Total memory for the cluster: 2.5TB
* Total storage: 400TB
* Total CPU cores: 480

Hadoop stack: IBM Open Platform with Apache Hadoop v4.2. Apache Spark 2.0 GA

Database info:
* Schema: TPCDS 
* Scale factor: 1TB total space
* Storage format: Parquet with Snappy compression

Our investigation and results are included in the attached document.

There are two parts to this improvement:
# Join reordering using star schema detection
# New selectivity hint to specify the selectivity of the predicates over base 
tables.
\\
\\





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to