Github user inouehrs commented on the issue:
https://github.com/apache/spark/pull/13459
@rxin Thank you for the suggestion.
When `res.count` is used instead of `res.queryExecution.toRdd.foreach(_ =>
Unit)`, the execution times become much shorter as shown below. Especially, the
DataFrame performances are impressive.
In this case, the overhead of the conversion to RDD is replaced by the
aggregation overhead.
When I used `res.foreach(_ => Unit)` instead of
`res.queryExecution.toRdd.foreach(_ => Unit)`, the performance was degraded.
I am going to add these aggregate versions of tests in my pull request.
Do you have any suggestion on an action to use here instead of `count`.
```
OpenJDK 64-Bit Server VM 1.8.0_91-b14 on Linux 3.10.0-229.el7.x86_64
Intel Xeon E3-12xx v2 (Ivy Bridge)
back-to-back map: Best/Avg Time(ms) Rate(M/s)
Per Row(ns) Relative
------------------------------------------------------------------------------------------------
RDD 2118 / 2300 47.2
21.2 1.0X
DataFrame 172 / 280 582.3
1.7 12.3X
Dataset 4895 / 4999 20.4
49.0 0.4X
OpenJDK 64-Bit Server VM 1.8.0_91-b14 on Linux 3.10.0-229.el7.x86_64
Intel Xeon E3-12xx v2 (Ivy Bridge)
back-to-back map for primitive: Best/Avg Time(ms) Rate(M/s)
Per Row(ns) Relative
------------------------------------------------------------------------------------------------
RDD 883 / 1150 113.2
8.8 1.0X
DataFrame 110 / 121 905.2
1.1 8.0X
Dataset 3880 / 3915 25.8
38.8 0.2X
```
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]