[ 
https://issues.apache.org/jira/browse/SPARK-17626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ioana Delaney updated SPARK-17626:
----------------------------------
    Description: 
*TPC-DS performance improvements using star-schema heuristics*
\\
\\
TPC-DS consists of multiple snowflake schema, which are multiple star schema 
with dimensions linking to dimensions. A star schema consists of a fact table 
referencing a number of dimension tables. Fact table holds the main data about 
a business. Dimension table, a usually smaller table, describes data reflecting 
the dimension/attribute of a business.
\\
\\
As part of the benchmark performance investigation, we observed a pattern of 
sub-optimal execution plans of large fact tables joins. Manual rewrite of some 
of the queries into selective fact-dimensions joins resulted in significant 
performance improvement. This prompted us to develop a simple join reordering 
algorithm based on star schema detection. The performance testing using *1TB 
TPC-DS workload* shows an overall improvement of *19%*. 
\\
\\
*Summary of the results:*
{code}
Passed                 99
Failed                  0
Total q time (s)   14,962
Max time            1,467
Min time                3
Mean time             145
Geomean                44
{code}

*Compared to baseline* (Negative = improvement; Positive = Degradation):
{code}
End to end improved (%)              -19%       
Mean time improved (%)               -19%
Geomean improved (%)                 -24%
End to end improved (seconds)      -3,603
Number of queries improved (>10%)      45
Number of queries degraded (>10%)       6
Number of queries unchanged            48
Top 10 queries improved (%)          -20%
{code}

Cluster: 20-node cluster with each node having:
* 10 2TB hard disks in a JBOD configuration, 2 Intel(R) Xeon(R) CPU E5-2680 v2 
@ 2.80GHz processors, 128 GB RAM, 10Gigabit Ethernet.
* Total memory for the cluster: 2.5TB
* Total storage: 400TB
* Total CPU cores: 480

Hadoop stack: IBM Open Platform with Apache Hadoop v4.2. Apache Spark 2.0 GA

Database info:
* Schema: TPCDS 
* Scale factor: 1TB total space
* Storage format: Parquet with Snappy compression

Our investigation and results are included in the attached document.

There are two parts to this improvement:
# Join reordering using star schema detection
# New selectivity hint to specify the selectivity of the predicates over base 
tables. Selectivity hint is optional and it was not used in the above TPC-DS 
tests. 
\\



  was:
*TPC-DS performance improvements using star-schema heuristics*
\\
\\
TPC-DS consists of multiple snowflake schema, which are multiple star schema 
with dimensions linking to dimensions. A star schema consists of a fact table 
referencing a number of dimension tables. Fact table holds the main data about 
a business. Dimension table, a usually smaller table, describes data reflecting 
the dimension/attribute of a business.
\\
\\
As part of the benchmark performance investigation, we observed a pattern of 
sub-optimal execution plans of large fact tables joins. Manual rewrite of some 
of the queries into selective fact-dimensions joins resulted in significant 
performance improvement. This prompted us to develop a simple join reordering 
algorithm based on star schema detection. The performance testing using *1TB 
TPC-DS workload* shows an overall improvement of *19%*. 
\\
\\
*Summary of the results:*
{code}
Passed                 99
Failed                  0
Total q time (s)   14,962
Max time            1,467
Min time                3
Mean time             145
Geomean                44
{code}

*Compared to baseline* (Negative = improvement; Positive = Degradation):
{code}
End to end improved (%)              -19%       
Mean time improved (%)               -19%
Geomean improved (%)                 -24%
End to end improved (seconds)      -3,603
Number of queries improved (>10%)      45
Number of queries degraded (>10%)       6
Number of queries unchanged            48
Top 10 queries improved (%)          -20%
{code}

Cluster: 20-node cluster with each node having:
* 10 2TB hard disks in a JBOD configuration, 2 Intel(R) Xeon(R) CPU E5-2680 v2 
@ 2.80GHz processors, 128 GB RAM, 10Gigabit Ethernet.
* Total memory for the cluster: 2.5TB
* Total storage: 400TB
* Total CPU cores: 480

Hadoop stack: IBM Open Platform with Apache Hadoop v4.2. Apache Spark 2.0 GA

Database info:
* Schema: TPCDS 
* Scale factor: 1TB total space
* Storage format: Parquet with Snappy compression

Our investigation and results are included in the attached document.

There are two parts to this improvement:
# Join reordering using star schema detection
# New selectivity hint to specify the selectivity of the predicates over base 
tables.
\\
\\




> TPC-DS performance improvements using star-schema heuristics
> ------------------------------------------------------------
>
>                 Key: SPARK-17626
>                 URL: https://issues.apache.org/jira/browse/SPARK-17626
>             Project: Spark
>          Issue Type: Umbrella
>          Components: SQL
>    Affects Versions: 2.1.0
>            Reporter: Ioana Delaney
>            Priority: Critical
>         Attachments: StarSchemaJoinReordering.pptx
>
>
> *TPC-DS performance improvements using star-schema heuristics*
> \\
> \\
> TPC-DS consists of multiple snowflake schema, which are multiple star schema 
> with dimensions linking to dimensions. A star schema consists of a fact table 
> referencing a number of dimension tables. Fact table holds the main data 
> about a business. Dimension table, a usually smaller table, describes data 
> reflecting the dimension/attribute of a business.
> \\
> \\
> As part of the benchmark performance investigation, we observed a pattern of 
> sub-optimal execution plans of large fact tables joins. Manual rewrite of 
> some of the queries into selective fact-dimensions joins resulted in 
> significant performance improvement. This prompted us to develop a simple 
> join reordering algorithm based on star schema detection. The performance 
> testing using *1TB TPC-DS workload* shows an overall improvement of *19%*. 
> \\
> \\
> *Summary of the results:*
> {code}
> Passed                 99
> Failed                  0
> Total q time (s)   14,962
> Max time            1,467
> Min time                3
> Mean time             145
> Geomean                44
> {code}
> *Compared to baseline* (Negative = improvement; Positive = Degradation):
> {code}
> End to end improved (%)              -19%     
> Mean time improved (%)               -19%
> Geomean improved (%)                 -24%
> End to end improved (seconds)      -3,603
> Number of queries improved (>10%)      45
> Number of queries degraded (>10%)       6
> Number of queries unchanged            48
> Top 10 queries improved (%)          -20%
> {code}
> Cluster: 20-node cluster with each node having:
> * 10 2TB hard disks in a JBOD configuration, 2 Intel(R) Xeon(R) CPU E5-2680 
> v2 @ 2.80GHz processors, 128 GB RAM, 10Gigabit Ethernet.
> * Total memory for the cluster: 2.5TB
> * Total storage: 400TB
> * Total CPU cores: 480
> Hadoop stack: IBM Open Platform with Apache Hadoop v4.2. Apache Spark 2.0 GA
> Database info:
> * Schema: TPCDS 
> * Scale factor: 1TB total space
> * Storage format: Parquet with Snappy compression
> Our investigation and results are included in the attached document.
> There are two parts to this improvement:
> # Join reordering using star schema detection
> # New selectivity hint to specify the selectivity of the predicates over base 
> tables. Selectivity hint is optional and it was not used in the above TPC-DS 
> tests. 
> \\



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to