[GitHub] spark issue #13459: [SPARK-15726] [SQL] Make DatasetBenchmark fairer among D...

inouehrs Thu, 02 Jun 2016 08:42:52 -0700

Github user inouehrs commented on the issue:

    https://github.com/apache/spark/pull/13459
  
    @rxin Thank you for the suggestion.
    When `res.count` is used instead of `res.queryExecution.toRdd.foreach(_ => 
Unit)`, the execution times become much shorter as shown below. Especially, the 
DataFrame performances are impressive.
    In this case, the overhead of the conversion to RDD is replaced by the 
aggregation overhead.
    When I used `res.foreach(_ => Unit)` instead of 
`res.queryExecution.toRdd.foreach(_ => Unit)`, the performance was degraded.
    I am going to add these aggregate versions of tests in my pull request.
    Do you have any suggestion on an action to use here instead of `count`.
    
    ```
    OpenJDK 64-Bit Server VM 1.8.0_91-b14 on Linux 3.10.0-229.el7.x86_64
    Intel Xeon E3-12xx v2 (Ivy Bridge)
    back-to-back map:                        Best/Avg Time(ms)    Rate(M/s)   
Per Row(ns)   Relative
    
------------------------------------------------------------------------------------------------
    RDD                                           2118 / 2300         47.2      
    21.2       1.0X
    DataFrame                                      172 /  280        582.3      
     1.7      12.3X
    Dataset                                       4895 / 4999         20.4      
    49.0       0.4X
    
    OpenJDK 64-Bit Server VM 1.8.0_91-b14 on Linux 3.10.0-229.el7.x86_64
    Intel Xeon E3-12xx v2 (Ivy Bridge)
    back-to-back map for primitive:          Best/Avg Time(ms)    Rate(M/s)   
Per Row(ns)   Relative
    
------------------------------------------------------------------------------------------------
    RDD                                            883 / 1150        113.2      
     8.8       1.0X
    DataFrame                                      110 /  121        905.2      
     1.1       8.0X
    Dataset                                       3880 / 3915         25.8      
    38.8       0.2X
    ```



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #13459: [SPARK-15726] [SQL] Make DatasetBenchmark fairer among D...

Reply via email to