GitHub user inouehrs opened a pull request:
https://github.com/apache/spark/pull/13459
[SPARK-15726] [SQL] Make DatasetBenchmark fairer among Dataset, DataFrame
and RDD
## What changes were proposed in this pull request?
DatasetBenchmark compares the performances of RDD, DataFrame and Dataset
while running the same operations.
In backToBackMap test case, however, only DataFrame implementation executes
less work compared to RDD or Dataset implementations. This test case processes
Long+String pairs, but the output from the DataFrame implementation does not
include String part while RDD or Dataset generates Long+String pairs as output.
This difference significantly changes the performance characteristics due to
the String manipulation and creation overheads. After the fix RDD outperforms
DataFrame, while DataFrame was more than 2x faster than RDD without the fix.
Also, the performance gap between DataFrame and Dataset becomes much narrower.
Of course, this issue does not affect Spark users, but it may confuse Spark
developers.
```scala
// DataFrame
val df = spark.range(1, numRows).select($"id".as("l"),
$"id".cast(StringType).as("s"))
var res = df
res = res.select($"l" + 1 as "l")
// this should be res = res.select($"l" + 1 as "l", $"s") for fair
comparison
// Dataset
case class Data(l: Long, s: String)
val func = (d: Data) => Data(d.l + 1, d.s)
var res = df.as[Data]
res = res.map(func)
```
Additionally, I added a new test case named "back-to-back map for
primitive". This is almost equivalent with the old behavior of the DataFrame
implementation of back-to-back map.
```
without fix
OpenJDK 64-Bit Server VM 1.8.0_91-b14 on Linux 3.10.0-229.el7.x86_64
Intel Xeon E3-12xx v2 (Ivy Bridge)
back-to-back map: Best/Avg Time(ms) Rate(M/s)
Per Row(ns) Relative
------------------------------------------------------------------------------------------------
RDD 2051 / 2077 48.7
20.5 1.0X
DataFrame 755 / 940 132.5
7.5 2.7X
Dataset 6155 / 6680 16.2
61.6 0.3X
with fix
back-to-back map: Best/Avg Time(ms) Rate(M/s)
Per Row(ns) Relative
------------------------------------------------------------------------------------------------
RDD 2077 / 2259 48.1
20.8 1.0X
DataFrame 3030 / 3310 33.0
30.3 0.7X
Dataset 6504 / 7006 15.4
65.0 0.3X
back-to-back map for primitive: Best/Avg Time(ms) Rate(M/s)
Per Row(ns) Relative
------------------------------------------------------------------------------------------------
RDD 1073 / 1509 93.2
10.7 1.0X
DataFrame 763 / 913 131.0
7.6 1.4X
Dataset 4189 / 4312 23.9
41.9 0.3X
```
Note that DatasetBenchmark causes JVM crash in an aggregate test case. This
is not related to this issue.
I already created a jira entry and submited a pull request for this issue.
https://issues.apache.org/jira/browse/SPARK-15704
https://github.com/apache/spark/pull/13446
## How was this patch tested?
By executing the DatasetBenchmark
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/inouehrs/spark fix_benchmark_fairness
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/13459.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #13459
----
commit ca21e673916ecc7c51f24ddfef8f748bade2e11b
Author: Hiroshi Inoue <[email protected]>
Date: 2016-05-31T07:17:07Z
make backToBackMap of DatasetBenchmark fair
In the original implementation, DataFrame version processes less work
than RDD and Dataset versions.
commit f1e49f35e03c1d69662273b91b67b734d95b117b
Author: Hiroshi Inoue <[email protected]>
Date: 2016-05-31T07:24:09Z
add backToBackMapPrimitive in DatasetBenchmark
The backToBackMapPrimitive is almost equivalent with the original
backToBackMap implementation for DataFrame
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]