GitHub user inouehrs opened a pull request:

    https://github.com/apache/spark/pull/13459

    [SPARK-15726] [SQL] Make DatasetBenchmark fairer among Dataset, DataFrame 
and RDD

    ## What changes were proposed in this pull request?
    
    DatasetBenchmark compares the performances of RDD, DataFrame and Dataset 
while running the same operations.
    In backToBackMap test case, however, only DataFrame implementation executes 
less work compared to RDD or Dataset implementations. This test case processes 
Long+String pairs, but the output from the DataFrame implementation does not 
include String part while RDD or Dataset generates Long+String pairs as output. 
This difference significantly changes the performance characteristics due to 
the String manipulation and creation overheads. After the fix RDD outperforms 
DataFrame, while DataFrame was more than 2x faster than RDD without the fix. 
Also, the performance gap between DataFrame and Dataset becomes much narrower.
    
    Of course, this issue does not affect Spark users, but it may confuse Spark 
developers.
    
    ```scala
    // DataFrame
    val df = spark.range(1, numRows).select($"id".as("l"), 
$"id".cast(StringType).as("s"))
    var res = df
    res = res.select($"l" + 1 as "l")
    // this should be res = res.select($"l" + 1 as "l", $"s") for fair 
comparison
    
    // Dataset 
    case class Data(l: Long, s: String)
    val func = (d: Data) => Data(d.l + 1, d.s)
    var res = df.as[Data]
    res = res.map(func)
    ```
    
    Additionally, I added a new test case named "back-to-back map for 
primitive". This is almost equivalent with the old behavior of the DataFrame 
implementation of back-to-back map.
    
    ```
    without fix
    OpenJDK 64-Bit Server VM 1.8.0_91-b14 on Linux 3.10.0-229.el7.x86_64
    Intel Xeon E3-12xx v2 (Ivy Bridge)
    back-to-back map:                        Best/Avg Time(ms)    Rate(M/s)   
Per Row(ns)   Relative
    
------------------------------------------------------------------------------------------------
    RDD                                           2051 / 2077         48.7      
    20.5       1.0X
    DataFrame                                      755 /  940        132.5      
     7.5       2.7X
    Dataset                                       6155 / 6680         16.2      
    61.6       0.3X
    
    with fix
    back-to-back map:                        Best/Avg Time(ms)    Rate(M/s)   
Per Row(ns)   Relative
    
------------------------------------------------------------------------------------------------
    RDD                                           2077 / 2259         48.1      
    20.8       1.0X
    DataFrame                                     3030 / 3310         33.0      
    30.3       0.7X
    Dataset                                       6504 / 7006         15.4      
    65.0       0.3X
    
    back-to-back map for primitive:          Best/Avg Time(ms)    Rate(M/s)   
Per Row(ns)   Relative
    
------------------------------------------------------------------------------------------------
    RDD                                           1073 / 1509         93.2      
    10.7       1.0X
    DataFrame                                      763 /  913        131.0      
     7.6       1.4X
    Dataset                                       4189 / 4312         23.9      
    41.9       0.3X
    ```
    
    Note that DatasetBenchmark causes JVM crash in an aggregate test case. This 
is not related to this issue.
    I already created a jira entry and submited a pull request for this issue.
    https://issues.apache.org/jira/browse/SPARK-15704
    https://github.com/apache/spark/pull/13446
    
    
    ## How was this patch tested?
    By executing the DatasetBenchmark


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/inouehrs/spark fix_benchmark_fairness

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/13459.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #13459
    
----
commit ca21e673916ecc7c51f24ddfef8f748bade2e11b
Author: Hiroshi Inoue <[email protected]>
Date:   2016-05-31T07:17:07Z

    make backToBackMap of DatasetBenchmark fair
    
    In the original implementation, DataFrame version processes  less work
    than RDD and Dataset versions.

commit f1e49f35e03c1d69662273b91b67b734d95b117b
Author: Hiroshi Inoue <[email protected]>
Date:   2016-05-31T07:24:09Z

    add backToBackMapPrimitive in DatasetBenchmark
    
    The backToBackMapPrimitive is almost equivalent with the original
    backToBackMap implementation for DataFrame

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to