This is an automated email from the ASF dual-hosted git repository.
gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git
The following commit(s) were added to refs/heads/master by this push:
new 6115a5e [SPARK-27327][SQL] New JSON benchmarks: functions,
Dataset[String]
6115a5e is described below
commit 6115a5e1a096e4a2c97ea9b9b18848a782a05b25
Author: Maxim Gekk <[email protected]>
AuthorDate: Mon Apr 1 08:33:16 2019 +0900
[SPARK-27327][SQL] New JSON benchmarks: functions, Dataset[String]
## What changes were proposed in this pull request?
Added new benchmarks for:
1. JSON functions: `from_json`, `json_tuple` and `get_json_object`
2. Parsing `Dataset[String]` with JSON records
3. Comparing just splitting input text by lines with schema inferring,
per-line parsing when encoding is set and not set.
Also existing benchmarks were refactored to use the `NoOp` datasource to
eliminate overhead of triggers like `.filter((_: Row) => true).count()`.
## How was this patch tested?
By running `JSONBenchmark` locally.
Closes #24252 from MaxGekk/json-benchmark-func.
Authored-by: Maxim Gekk <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
---
sql/core/benchmarks/JSONBenchmark-results.txt | 124 ++++++++++-------
.../execution/datasources/json/JsonBenchmark.scala | 155 +++++++++++++++++----
2 files changed, 206 insertions(+), 73 deletions(-)
diff --git a/sql/core/benchmarks/JSONBenchmark-results.txt
b/sql/core/benchmarks/JSONBenchmark-results.txt
index f16e60c..2b784c3 100644
--- a/sql/core/benchmarks/JSONBenchmark-results.txt
+++ b/sql/core/benchmarks/JSONBenchmark-results.txt
@@ -3,53 +3,81 @@ Benchmark for performance of JSON parsing
================================================================================================
Preparing data for benchmarking ...
-Java HotSpot(TM) 64-Bit Server VM 1.8.0_191-b12 on Linux 3.16.0-31-generic
-Intel(R) Xeon(R) CPU @ 2.50GHz
-JSON schema inferring: Best/Avg Time(ms) Rate(M/s) Per
Row(ns) Relative
-------------------------------------------------------------------------------------------------
-No encoding 80821 / 82526 1.2
808.2 1.0X
-UTF-8 is set 129478 / 130381 0.8
1294.8 0.6X
-
-Preparing data for benchmarking ...
-Java HotSpot(TM) 64-Bit Server VM 1.8.0_191-b12 on Linux 3.16.0-31-generic
-Intel(R) Xeon(R) CPU @ 2.50GHz
-count a short column: Best/Avg Time(ms) Rate(M/s) Per
Row(ns) Relative
-------------------------------------------------------------------------------------------------
-No encoding 16804 / 16948 6.0
168.0 1.0X
-UTF-8 is set 16648 / 16757 6.0
166.5 1.0X
-
-Preparing data for benchmarking ...
-Java HotSpot(TM) 64-Bit Server VM 1.8.0_191-b12 on Linux 3.16.0-31-generic
-Intel(R) Xeon(R) CPU @ 2.50GHz
-count a wide column: Best/Avg Time(ms) Rate(M/s) Per
Row(ns) Relative
-------------------------------------------------------------------------------------------------
-No encoding 30949 / 31058 0.3
3094.9 1.0X
-UTF-8 is set 30629 / 33896 0.3
3062.9 1.0X
-
-Preparing data for benchmarking ...
-Java HotSpot(TM) 64-Bit Server VM 1.8.0_191-b12 on Linux 3.16.0-31-generic
-Intel(R) Xeon(R) CPU @ 2.50GHz
-select wide row: Best/Avg Time(ms) Rate(M/s) Per
Row(ns) Relative
-------------------------------------------------------------------------------------------------
-No encoding 123050 / 124199 0.0
246099.8 1.0X
-UTF-8 is set 139306 / 142569 0.0
278612.7 0.9X
-
-Preparing data for benchmarking ...
-Java HotSpot(TM) 64-Bit Server VM 1.8.0_191-b12 on Linux 3.16.0-31-generic
-Intel(R) Xeon(R) CPU @ 2.50GHz
-Select a subset of 10 columns: Best/Avg Time(ms) Rate(M/s) Per
Row(ns) Relative
-------------------------------------------------------------------------------------------------
-Select 10 columns + count() 19539 / 19896 0.5
1953.9 1.0X
-Select 1 column + count() 16412 / 16445 0.6
1641.2 1.2X
-
-Preparing data for benchmarking ...
-Java HotSpot(TM) 64-Bit Server VM 1.8.0_191-b12 on Linux 3.16.0-31-generic
-Intel(R) Xeon(R) CPU @ 2.50GHz
-creation of JSON parser per line: Best/Avg Time(ms) Rate(M/s) Per
Row(ns) Relative
-------------------------------------------------------------------------------------------------
-Short column without encoding 9576 / 9612 1.0
957.6 1.0X
-Short column with UTF-8 13555 / 13698 0.7
1355.5 0.7X
-Wide column without encoding 174761 / 175665 0.1
17476.1 0.1X
-Wide column with UTF-8 203219 / 205151 0.0
20321.9 0.0X
+Java HotSpot(TM) 64-Bit Server VM 1.8.0_202-b08 on Mac OS X 10.14.4
+Intel(R) Core(TM) i7-4850HQ CPU @ 2.30GHz
+JSON schema inferring: Best Time(ms) Avg Time(ms)
Stdev(ms) Rate(M/s) Per Row(ns) Relative
+------------------------------------------------------------------------------------------------------------------------
+No encoding 51280 51722
420 2.0 512.8 1.0X
+UTF-8 is set 75009 77276
1963 1.3 750.1 0.7X
+Preparing data for benchmarking ...
+Java HotSpot(TM) 64-Bit Server VM 1.8.0_202-b08 on Mac OS X 10.14.4
+Intel(R) Core(TM) i7-4850HQ CPU @ 2.30GHz
+count a short column: Best Time(ms) Avg Time(ms)
Stdev(ms) Rate(M/s) Per Row(ns) Relative
+------------------------------------------------------------------------------------------------------------------------
+No encoding 39675 39738
83 2.5 396.7 1.0X
+UTF-8 is set 62755 64399
1436 1.6 627.5 0.6X
+
+Preparing data for benchmarking ...
+Java HotSpot(TM) 64-Bit Server VM 1.8.0_202-b08 on Mac OS X 10.14.4
+Intel(R) Core(TM) i7-4850HQ CPU @ 2.30GHz
+count a wide column: Best Time(ms) Avg Time(ms)
Stdev(ms) Rate(M/s) Per Row(ns) Relative
+------------------------------------------------------------------------------------------------------------------------
+No encoding 56429 56468
65 0.2 5642.9 1.0X
+UTF-8 is set 81078 81454
374 0.1 8107.8 0.7X
+
+Preparing data for benchmarking ...
+Java HotSpot(TM) 64-Bit Server VM 1.8.0_202-b08 on Mac OS X 10.14.4
+Intel(R) Core(TM) i7-4850HQ CPU @ 2.30GHz
+select wide row: Best Time(ms) Avg Time(ms)
Stdev(ms) Rate(M/s) Per Row(ns) Relative
+------------------------------------------------------------------------------------------------------------------------
+No encoding 95329 95557
265 0.0 190658.2 1.0X
+UTF-8 is set 102827 102967
166 0.0 205654.2 0.9X
+
+Preparing data for benchmarking ...
+Java HotSpot(TM) 64-Bit Server VM 1.8.0_202-b08 on Mac OS X 10.14.4
+Intel(R) Core(TM) i7-4850HQ CPU @ 2.30GHz
+Select a subset of 10 columns: Best Time(ms) Avg Time(ms)
Stdev(ms) Rate(M/s) Per Row(ns) Relative
+------------------------------------------------------------------------------------------------------------------------
+Select 10 columns 14102 14136
52 0.7 1410.2 1.0X
+Select 1 column 17487 17537
51 0.6 1748.7 0.8X
+
+Preparing data for benchmarking ...
+Java HotSpot(TM) 64-Bit Server VM 1.8.0_202-b08 on Mac OS X 10.14.4
+Intel(R) Core(TM) i7-4850HQ CPU @ 2.30GHz
+creation of JSON parser per line: Best Time(ms) Avg Time(ms)
Stdev(ms) Rate(M/s) Per Row(ns) Relative
+------------------------------------------------------------------------------------------------------------------------
+Short column without encoding 6013 6066
70 1.7 601.3 1.0X
+Short column with UTF-8 8031 8079
45 1.2 803.1 0.7X
+Wide column without encoding 107093 108539
NaN 0.1 10709.3 0.1X
+Wide column with UTF-8 130983 132518
1346 0.1 13098.3 0.0X
+
+Preparing data for benchmarking ...
+Java HotSpot(TM) 64-Bit Server VM 1.8.0_202-b08 on Mac OS X 10.14.4
+Intel(R) Core(TM) i7-4850HQ CPU @ 2.30GHz
+JSON functions: Best Time(ms) Avg Time(ms)
Stdev(ms) Rate(M/s) Per Row(ns) Relative
+------------------------------------------------------------------------------------------------------------------------
+Text read 939 950
11 10.6 93.9 1.0X
+from_json 12924 12944
26 0.8 1292.4 0.1X
+json_tuple 15312 15771
432 0.7 1531.2 0.1X
+get_json_object 13049 13475
714 0.8 1304.9 0.1X
+
+Preparing data for benchmarking ...
+Java HotSpot(TM) 64-Bit Server VM 1.8.0_202-b08 on Mac OS X 10.14.4
+Intel(R) Core(TM) i7-4850HQ CPU @ 2.30GHz
+Dataset of json strings: Best Time(ms) Avg Time(ms)
Stdev(ms) Rate(M/s) Per Row(ns) Relative
+------------------------------------------------------------------------------------------------------------------------
+Text read 4556 4630
108 11.0 91.1 1.0X
+schema inferring 23624 24338
626 2.1 472.5 0.2X
+parsing 22342 22420
81 2.2 446.8 0.2X
+
+Preparing data for benchmarking ...
+Java HotSpot(TM) 64-Bit Server VM 1.8.0_202-b08 on Mac OS X 10.14.4
+Intel(R) Core(TM) i7-4850HQ CPU @ 2.30GHz
+Json files in the per-line mode: Best Time(ms) Avg Time(ms)
Stdev(ms) Rate(M/s) Per Row(ns) Relative
+------------------------------------------------------------------------------------------------------------------------
+Text read 7537 7556
26 6.6 150.7 1.0X
+Schema inferring 27875 28306
499 1.8 557.5 0.3X
+Parsing without charset 26030 26083
67 1.9 520.6 0.3X
+Parsing with UTF-8 37115 37480
392 1.3 742.3 0.2X
diff --git
a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonBenchmark.scala
b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonBenchmark.scala
index 25f7620..f9e867b 100644
---
a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonBenchmark.scala
+++
b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonBenchmark.scala
@@ -17,9 +17,9 @@
package org.apache.spark.sql.execution.datasources.json
import org.apache.spark.benchmark.Benchmark
-import org.apache.spark.sql.Row
+import org.apache.spark.sql.{Dataset, Row}
import org.apache.spark.sql.execution.benchmark.SqlBasedBenchmark
-import org.apache.spark.sql.functions.lit
+import org.apache.spark.sql.functions.{from_json, get_json_object, json_tuple,
lit}
import org.apache.spark.sql.types._
/**
@@ -39,12 +39,16 @@ import org.apache.spark.sql.types._
object JSONBenchmark extends SqlBasedBenchmark {
import spark.implicits._
- def prepareDataInfo(benchmark: Benchmark): Unit = {
+ private def prepareDataInfo(benchmark: Benchmark): Unit = {
// scalastyle:off println
benchmark.out.println("Preparing data for benchmarking ...")
// scalastyle:on println
}
+ private def run(ds: Dataset[_]): Unit = {
+ ds.write.format("noop").save()
+ }
+
def schemaInferring(rowsNum: Int, numIters: Int): Unit = {
val benchmark = new Benchmark("JSON schema inferring", rowsNum, output =
output)
@@ -202,20 +206,21 @@ object JSONBenchmark extends SqlBasedBenchmark {
val fields = Seq.tabulate(colsNum)(i => StructField(s"col$i",
IntegerType))
val schema = StructType(fields)
- val columnNames = schema.fieldNames
spark.range(rowsNum)
.select(Seq.tabulate(colsNum)(i => lit(i).as(s"col$i")): _*)
.write
.json(path.getAbsolutePath)
- val ds = spark.read.schema(schema).json(path.getAbsolutePath)
+ val in = spark.read.schema(schema).json(path.getAbsolutePath)
- benchmark.addCase(s"Select $colsNum columns + count()", numIters) { _ =>
- ds.select("*").filter((_: Row) => true).count()
+ benchmark.addCase(s"Select $colsNum columns", numIters) { _ =>
+ val ds = in.select("*")
+ run(ds)
}
- benchmark.addCase(s"Select 1 column + count()", numIters) { _ =>
- ds.select($"col1").filter((_: Row) => true).count()
+ benchmark.addCase(s"Select 1 column", numIters) { _ =>
+ val ds = in.select($"col1")
+ run(ds)
}
benchmark.run()
@@ -235,37 +240,134 @@ object JSONBenchmark extends SqlBasedBenchmark {
val wideSchema = writeWideColumn(wideColumnPath, rowsNum)
benchmark.addCase("Short column without encoding", numIters) { _ =>
- spark.read
- .schema(shortSchema)
- .json(shortColumnPath)
- .filter((_: Row) => true)
- .count()
+ val ds = spark.read.schema(shortSchema).json(shortColumnPath)
+ run(ds)
}
benchmark.addCase("Short column with UTF-8", numIters) { _ =>
- spark.read
+ val ds = spark.read
.option("encoding", "UTF-8")
.schema(shortSchema)
.json(shortColumnPath)
- .filter((_: Row) => true)
- .count()
+ run(ds)
}
benchmark.addCase("Wide column without encoding", numIters) { _ =>
- spark.read
- .schema(wideSchema)
- .json(wideColumnPath)
- .filter((_: Row) => true)
- .count()
+ val ds = spark.read.schema(wideSchema).json(wideColumnPath)
+ run(ds)
}
benchmark.addCase("Wide column with UTF-8", numIters) { _ =>
- spark.read
+ val ds = spark.read
.option("encoding", "UTF-8")
.schema(wideSchema)
.json(wideColumnPath)
- .filter((_: Row) => true)
- .count()
+ run(ds)
+ }
+
+ benchmark.run()
+ }
+ }
+
+ def jsonFunctions(rows: Int, iters: Int): Unit = {
+ val benchmark = new Benchmark("JSON functions", rows, output = output)
+
+ prepareDataInfo(benchmark)
+
+ val in = spark.range(0, rows, 1, 1).map(_ => """{"a":1}""")
+
+ benchmark.addCase("Text read", iters) { _ =>
+ run(in)
+ }
+
+ benchmark.addCase("from_json", iters) { _ =>
+ val schema = new StructType().add("a", IntegerType)
+ val from_json_ds = in.select(from_json('value, schema))
+ run(from_json_ds)
+ }
+
+ benchmark.addCase("json_tuple", iters) { _ =>
+ val json_tuple_ds = in.select(json_tuple($"value", "a"))
+ run(json_tuple_ds)
+ }
+
+ benchmark.addCase("get_json_object", iters) { _ =>
+ val get_json_object_ds = in.select(get_json_object($"value", "$.a"))
+ run(get_json_object_ds)
+ }
+
+ benchmark.run()
+ }
+
+ def jsonInDS(rows: Int, iters: Int): Unit = {
+ val benchmark = new Benchmark("Dataset of json strings", rows, output =
output)
+
+ prepareDataInfo(benchmark)
+
+ val in = spark.range(0, rows, 1, 1).map(_ => """{"a":1}""")
+
+ benchmark.addCase("Text read", iters) { _ =>
+ run(in)
+ }
+
+ benchmark.addCase("schema inferring", iters) { _ =>
+ spark.read.json(in).schema
+ }
+
+ benchmark.addCase("parsing", iters) { _ =>
+ val schema = new StructType().add("a", IntegerType)
+ val ds = spark.read
+ .schema(schema)
+ .json(in)
+ run(ds)
+ }
+
+ benchmark.run()
+ }
+
+ def jsonInFile(rows: Int, iters: Int): Unit = {
+ val benchmark = new Benchmark("Json files in the per-line mode", rows,
output = output)
+
+ withTempPath { path =>
+ prepareDataInfo(benchmark)
+
+ spark.sparkContext.range(0, rows, 1, 1)
+ .toDF("a")
+ .write
+ .json(path.getAbsolutePath)
+
+ benchmark.addCase("Text read", iters) { _ =>
+ val ds = spark.read
+ .format("text")
+ .load(path.getAbsolutePath)
+ run(ds)
+ }
+
+ benchmark.addCase("Schema inferring", iters) { _ =>
+ val ds = spark.read
+ .option("multiLine", false)
+ .json(path.getAbsolutePath)
+ ds.schema
+ }
+
+ val schema = new StructType().add("a", LongType)
+
+ benchmark.addCase("Parsing without charset", iters) { _ =>
+ val ds = spark.read
+ .schema(schema)
+ .option("multiLine", false)
+ .json(path.getAbsolutePath)
+ run(ds)
+ }
+
+ benchmark.addCase("Parsing with UTF-8", iters) { _ =>
+ val ds = spark.read
+ .schema(schema)
+ .option("multiLine", false)
+ .option("charset", "UTF-8")
+ .json(path.getAbsolutePath)
+
+ run(ds)
}
benchmark.run()
@@ -281,6 +383,9 @@ object JSONBenchmark extends SqlBasedBenchmark {
countWideRow(500 * 1000, numIters)
selectSubsetOfColumns(10 * 1000 * 1000, numIters)
jsonParserCreation(10 * 1000 * 1000, numIters)
+ jsonFunctions(10 * 1000 * 1000, numIters)
+ jsonInDS(50 * 1000 * 1000, numIters)
+ jsonInFile(50 * 1000 * 1000, numIters)
}
}
}
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]