Hiroshi Inoue created SPARK-15726:
-------------------------------------
Summary: Make DatasetBenchmark more fairer among Dataset,
DataFrame and RDD
Key: SPARK-15726
URL: https://issues.apache.org/jira/browse/SPARK-15726
Project: Spark
Issue Type: Bug
Components: SQL
Affects Versions: 2.1.0
Reporter: Hiroshi Inoue
Priority: Minor
DatasetBenchmark compares the performances of RDD, DataFrame and Dataset while
running the same operations.
In backToBackMap test case, however, the only DataFrame implementation executes
less work compared to RDD or Dataset implementations. This test case processes
Long+String pairs, but the output from the DataFrame implementation does not
include String part while RDD or Dataset generates Long+String pairs as output.
This difference significantly changes the performance characteristics due to
the String manipulation and creation overheads. After the fix RDD outperforms
DataFrame, while DataFrame was more than 2x faster than RDD without the fix.
Also, the performance gap between DataFrame and Dataset becomes much narrower.
Of course, this issue does not affect Spark users, but it may confuse Spark
developers.
{quote}
*// DataFrame*
val df = spark.range(1, numRows).select($"id".as("l"),
$"id".cast(StringType).as("s"))
var res = df
{color:blue}res = res.select($"l" + 1 as "l"){color}
// this should be {color:red}res = res.select($"l" + 1 as "l", $"s"){color} for
fair comparison
*// Dataset*
case class Data(l: Long, s: String)
val func = (d: Data) => Data(d.l + 1, d.s)
var res = df.as\[Data\]
res = res.map(func)
{quote}
Additionally, I added a new test case named "back-to-back map for primitive".
This is almost equivalent with the old behavior of the DataFrame implementation
of back-to-back map.
{quote}
without fix
OpenJDK 64-Bit Server VM 1.8.0_91-b14 on Linux 3.10.0-229.el7.x86_64
Intel Xeon E3-12xx v2 (Ivy Bridge)
back-to-back map: Best/Avg Time(ms) Rate(M/s) Per
Row(ns) Relative
------------------------------------------------------------------------------------------------
RDD 2051 / 2077 48.7
20.5 1.0X
DataFrame 755 / 940 132.5
7.5 2.7X
Dataset 6155 / 6680 16.2
61.6 0.3X
with fix
back-to-back map: Best/Avg Time(ms) Rate(M/s) Per
Row(ns) Relative
------------------------------------------------------------------------------------------------
RDD 2077 / 2259 48.1
20.8 1.0X
DataFrame 3030 / 3310 33.0
30.3 0.7X
Dataset 6504 / 7006 15.4
65.0 0.3X
back-to-back map for primitive: Best/Avg Time(ms) Rate(M/s) Per
Row(ns) Relative
------------------------------------------------------------------------------------------------
RDD 1073 / 1509 93.2
10.7 1.0X
DataFrame 763 / 913 131.0
7.6 1.4X
Dataset 4189 / 4312 23.9
41.9 0.3X
{quote}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]