[spark] branch master updated: [SPARK-27327][SQL] New JSON benchmarks: functions, Dataset[String]

gurwls223 Sun, 31 Mar 2019 16:34:22 -0700

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git



The following commit(s) were added to refs/heads/master by this push:
     new 6115a5e  [SPARK-27327][SQL] New JSON benchmarks: functions, 
Dataset[String]
6115a5e is described below

commit 6115a5e1a096e4a2c97ea9b9b18848a782a05b25
Author: Maxim Gekk <[email protected]>
AuthorDate: Mon Apr 1 08:33:16 2019 +0900

    [SPARK-27327][SQL] New JSON benchmarks: functions, Dataset[String]
    
    ## What changes were proposed in this pull request?
    
    Added new benchmarks for:
    1. JSON functions: `from_json`, `json_tuple` and `get_json_object`
    2. Parsing `Dataset[String]` with JSON records
    3. Comparing just splitting input text by lines with schema inferring, 
per-line parsing when encoding is set and not set.
    
    Also existing benchmarks were refactored to use the `NoOp` datasource to 
eliminate overhead of triggers like `.filter((_: Row) => true).count()`.
    
    ## How was this patch tested?
    
    By running `JSONBenchmark` locally.
    
    Closes #24252 from MaxGekk/json-benchmark-func.
    
    Authored-by: Maxim Gekk <[email protected]>
    Signed-off-by: Hyukjin Kwon <[email protected]>
---
 sql/core/benchmarks/JSONBenchmark-results.txt      | 124 ++++++++++-------
 .../execution/datasources/json/JsonBenchmark.scala | 155 +++++++++++++++++----
 2 files changed, 206 insertions(+), 73 deletions(-)

diff --git a/sql/core/benchmarks/JSONBenchmark-results.txt 
b/sql/core/benchmarks/JSONBenchmark-results.txt
index f16e60c..2b784c3 100644
--- a/sql/core/benchmarks/JSONBenchmark-results.txt
+++ b/sql/core/benchmarks/JSONBenchmark-results.txt
@@ -3,53 +3,81 @@ Benchmark for performance of JSON parsing
 
================================================================================================
 
 Preparing data for benchmarking ...
-Java HotSpot(TM) 64-Bit Server VM 1.8.0_191-b12 on Linux 3.16.0-31-generic
-Intel(R) Xeon(R) CPU @ 2.50GHz
-JSON schema inferring:                   Best/Avg Time(ms)    Rate(M/s)   Per 
Row(ns)   Relative
-------------------------------------------------------------------------------------------------
-No encoding                                 80821 / 82526          1.2         
808.2       1.0X
-UTF-8 is set                              129478 / 130381          0.8        
1294.8       0.6X
-
-Preparing data for benchmarking ...
-Java HotSpot(TM) 64-Bit Server VM 1.8.0_191-b12 on Linux 3.16.0-31-generic
-Intel(R) Xeon(R) CPU @ 2.50GHz
-count a short column:                    Best/Avg Time(ms)    Rate(M/s)   Per 
Row(ns)   Relative
-------------------------------------------------------------------------------------------------
-No encoding                                 16804 / 16948          6.0         
168.0       1.0X
-UTF-8 is set                                16648 / 16757          6.0         
166.5       1.0X
-
-Preparing data for benchmarking ...
-Java HotSpot(TM) 64-Bit Server VM 1.8.0_191-b12 on Linux 3.16.0-31-generic
-Intel(R) Xeon(R) CPU @ 2.50GHz
-count a wide column:                     Best/Avg Time(ms)    Rate(M/s)   Per 
Row(ns)   Relative
-------------------------------------------------------------------------------------------------
-No encoding                                 30949 / 31058          0.3        
3094.9       1.0X
-UTF-8 is set                                30629 / 33896          0.3        
3062.9       1.0X
-
-Preparing data for benchmarking ...
-Java HotSpot(TM) 64-Bit Server VM 1.8.0_191-b12 on Linux 3.16.0-31-generic
-Intel(R) Xeon(R) CPU @ 2.50GHz
-select wide row:                         Best/Avg Time(ms)    Rate(M/s)   Per 
Row(ns)   Relative
-------------------------------------------------------------------------------------------------
-No encoding                               123050 / 124199          0.0      
246099.8       1.0X
-UTF-8 is set                              139306 / 142569          0.0      
278612.7       0.9X
-
-Preparing data for benchmarking ...
-Java HotSpot(TM) 64-Bit Server VM 1.8.0_191-b12 on Linux 3.16.0-31-generic
-Intel(R) Xeon(R) CPU @ 2.50GHz
-Select a subset of 10 columns:           Best/Avg Time(ms)    Rate(M/s)   Per 
Row(ns)   Relative
-------------------------------------------------------------------------------------------------
-Select 10 columns + count()                 19539 / 19896          0.5        
1953.9       1.0X
-Select 1 column + count()                   16412 / 16445          0.6        
1641.2       1.2X
-
-Preparing data for benchmarking ...
-Java HotSpot(TM) 64-Bit Server VM 1.8.0_191-b12 on Linux 3.16.0-31-generic
-Intel(R) Xeon(R) CPU @ 2.50GHz
-creation of JSON parser per line:        Best/Avg Time(ms)    Rate(M/s)   Per 
Row(ns)   Relative
-------------------------------------------------------------------------------------------------
-Short column without encoding                 9576 / 9612          1.0         
957.6       1.0X
-Short column with UTF-8                     13555 / 13698          0.7        
1355.5       0.7X
-Wide column without encoding              174761 / 175665          0.1       
17476.1       0.1X
-Wide column with UTF-8                    203219 / 205151          0.0       
20321.9       0.0X
+Java HotSpot(TM) 64-Bit Server VM 1.8.0_202-b08 on Mac OS X 10.14.4
+Intel(R) Core(TM) i7-4850HQ CPU @ 2.30GHz
+JSON schema inferring:                    Best Time(ms)   Avg Time(ms)   
Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
+------------------------------------------------------------------------------------------------------------------------
+No encoding                                       51280          51722         
420          2.0         512.8       1.0X
+UTF-8 is set                                      75009          77276        
1963          1.3         750.1       0.7X
 
+Preparing data for benchmarking ...
+Java HotSpot(TM) 64-Bit Server VM 1.8.0_202-b08 on Mac OS X 10.14.4
+Intel(R) Core(TM) i7-4850HQ CPU @ 2.30GHz
+count a short column:                     Best Time(ms)   Avg Time(ms)   
Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
+------------------------------------------------------------------------------------------------------------------------
+No encoding                                       39675          39738         
 83          2.5         396.7       1.0X
+UTF-8 is set                                      62755          64399        
1436          1.6         627.5       0.6X
+
+Preparing data for benchmarking ...
+Java HotSpot(TM) 64-Bit Server VM 1.8.0_202-b08 on Mac OS X 10.14.4
+Intel(R) Core(TM) i7-4850HQ CPU @ 2.30GHz
+count a wide column:                      Best Time(ms)   Avg Time(ms)   
Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
+------------------------------------------------------------------------------------------------------------------------
+No encoding                                       56429          56468         
 65          0.2        5642.9       1.0X
+UTF-8 is set                                      81078          81454         
374          0.1        8107.8       0.7X
+
+Preparing data for benchmarking ...
+Java HotSpot(TM) 64-Bit Server VM 1.8.0_202-b08 on Mac OS X 10.14.4
+Intel(R) Core(TM) i7-4850HQ CPU @ 2.30GHz
+select wide row:                          Best Time(ms)   Avg Time(ms)   
Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
+------------------------------------------------------------------------------------------------------------------------
+No encoding                                       95329          95557         
265          0.0      190658.2       1.0X
+UTF-8 is set                                     102827         102967         
166          0.0      205654.2       0.9X
+
+Preparing data for benchmarking ...
+Java HotSpot(TM) 64-Bit Server VM 1.8.0_202-b08 on Mac OS X 10.14.4
+Intel(R) Core(TM) i7-4850HQ CPU @ 2.30GHz
+Select a subset of 10 columns:            Best Time(ms)   Avg Time(ms)   
Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
+------------------------------------------------------------------------------------------------------------------------
+Select 10 columns                                 14102          14136         
 52          0.7        1410.2       1.0X
+Select 1 column                                   17487          17537         
 51          0.6        1748.7       0.8X
+
+Preparing data for benchmarking ...
+Java HotSpot(TM) 64-Bit Server VM 1.8.0_202-b08 on Mac OS X 10.14.4
+Intel(R) Core(TM) i7-4850HQ CPU @ 2.30GHz
+creation of JSON parser per line:         Best Time(ms)   Avg Time(ms)   
Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
+------------------------------------------------------------------------------------------------------------------------
+Short column without encoding                      6013           6066         
 70          1.7         601.3       1.0X
+Short column with UTF-8                            8031           8079         
 45          1.2         803.1       0.7X
+Wide column without encoding                     107093         108539         
NaN          0.1       10709.3       0.1X
+Wide column with UTF-8                           130983         132518        
1346          0.1       13098.3       0.0X
+
+Preparing data for benchmarking ...
+Java HotSpot(TM) 64-Bit Server VM 1.8.0_202-b08 on Mac OS X 10.14.4
+Intel(R) Core(TM) i7-4850HQ CPU @ 2.30GHz
+JSON functions:                           Best Time(ms)   Avg Time(ms)   
Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
+------------------------------------------------------------------------------------------------------------------------
+Text read                                           939            950         
 11         10.6          93.9       1.0X
+from_json                                         12924          12944         
 26          0.8        1292.4       0.1X
+json_tuple                                        15312          15771         
432          0.7        1531.2       0.1X
+get_json_object                                   13049          13475         
714          0.8        1304.9       0.1X
+
+Preparing data for benchmarking ...
+Java HotSpot(TM) 64-Bit Server VM 1.8.0_202-b08 on Mac OS X 10.14.4
+Intel(R) Core(TM) i7-4850HQ CPU @ 2.30GHz
+Dataset of json strings:                  Best Time(ms)   Avg Time(ms)   
Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
+------------------------------------------------------------------------------------------------------------------------
+Text read                                          4556           4630         
108         11.0          91.1       1.0X
+schema inferring                                  23624          24338         
626          2.1         472.5       0.2X
+parsing                                           22342          22420         
 81          2.2         446.8       0.2X
+
+Preparing data for benchmarking ...
+Java HotSpot(TM) 64-Bit Server VM 1.8.0_202-b08 on Mac OS X 10.14.4
+Intel(R) Core(TM) i7-4850HQ CPU @ 2.30GHz
+Json files in the per-line mode:          Best Time(ms)   Avg Time(ms)   
Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
+------------------------------------------------------------------------------------------------------------------------
+Text read                                          7537           7556         
 26          6.6         150.7       1.0X
+Schema inferring                                  27875          28306         
499          1.8         557.5       0.3X
+Parsing without charset                           26030          26083         
 67          1.9         520.6       0.3X
+Parsing with UTF-8                                37115          37480         
392          1.3         742.3       0.2X
 
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonBenchmark.scala
 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonBenchmark.scala
index 25f7620..f9e867b 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonBenchmark.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonBenchmark.scala
@@ -17,9 +17,9 @@
 package org.apache.spark.sql.execution.datasources.json
 
 import org.apache.spark.benchmark.Benchmark
-import org.apache.spark.sql.Row
+import org.apache.spark.sql.{Dataset, Row}
 import org.apache.spark.sql.execution.benchmark.SqlBasedBenchmark
-import org.apache.spark.sql.functions.lit
+import org.apache.spark.sql.functions.{from_json, get_json_object, json_tuple, 
lit}
 import org.apache.spark.sql.types._
 
 /**
@@ -39,12 +39,16 @@ import org.apache.spark.sql.types._
 object JSONBenchmark extends SqlBasedBenchmark {
   import spark.implicits._
 
-  def prepareDataInfo(benchmark: Benchmark): Unit = {
+  private def prepareDataInfo(benchmark: Benchmark): Unit = {
     // scalastyle:off println
     benchmark.out.println("Preparing data for benchmarking ...")
     // scalastyle:on println
   }
 
+  private def run(ds: Dataset[_]): Unit = {
+    ds.write.format("noop").save()
+  }
+
   def schemaInferring(rowsNum: Int, numIters: Int): Unit = {
     val benchmark = new Benchmark("JSON schema inferring", rowsNum, output = 
output)
 
@@ -202,20 +206,21 @@ object JSONBenchmark extends SqlBasedBenchmark {
 
       val fields = Seq.tabulate(colsNum)(i => StructField(s"col$i", 
IntegerType))
       val schema = StructType(fields)
-      val columnNames = schema.fieldNames
 
       spark.range(rowsNum)
         .select(Seq.tabulate(colsNum)(i => lit(i).as(s"col$i")): _*)
         .write
         .json(path.getAbsolutePath)
 
-      val ds = spark.read.schema(schema).json(path.getAbsolutePath)
+      val in = spark.read.schema(schema).json(path.getAbsolutePath)
 
-      benchmark.addCase(s"Select $colsNum columns + count()", numIters) { _ =>
-        ds.select("*").filter((_: Row) => true).count()
+      benchmark.addCase(s"Select $colsNum columns", numIters) { _ =>
+        val ds = in.select("*")
+        run(ds)
       }
-      benchmark.addCase(s"Select 1 column + count()", numIters) { _ =>
-        ds.select($"col1").filter((_: Row) => true).count()
+      benchmark.addCase(s"Select 1 column", numIters) { _ =>
+        val ds = in.select($"col1")
+        run(ds)
       }
 
       benchmark.run()
@@ -235,37 +240,134 @@ object JSONBenchmark extends SqlBasedBenchmark {
       val wideSchema = writeWideColumn(wideColumnPath, rowsNum)
 
       benchmark.addCase("Short column without encoding", numIters) { _ =>
-        spark.read
-          .schema(shortSchema)
-          .json(shortColumnPath)
-          .filter((_: Row) => true)
-          .count()
+        val ds = spark.read.schema(shortSchema).json(shortColumnPath)
+        run(ds)
       }
 
       benchmark.addCase("Short column with UTF-8", numIters) { _ =>
-        spark.read
+        val ds = spark.read
           .option("encoding", "UTF-8")
           .schema(shortSchema)
           .json(shortColumnPath)
-          .filter((_: Row) => true)
-          .count()
+        run(ds)
       }
 
       benchmark.addCase("Wide column without encoding", numIters) { _ =>
-        spark.read
-          .schema(wideSchema)
-          .json(wideColumnPath)
-          .filter((_: Row) => true)
-          .count()
+        val ds = spark.read.schema(wideSchema).json(wideColumnPath)
+        run(ds)
       }
 
       benchmark.addCase("Wide column with UTF-8", numIters) { _ =>
-        spark.read
+        val ds = spark.read
           .option("encoding", "UTF-8")
           .schema(wideSchema)
           .json(wideColumnPath)
-          .filter((_: Row) => true)
-          .count()
+        run(ds)
+      }
+
+      benchmark.run()
+    }
+  }
+
+  def jsonFunctions(rows: Int, iters: Int): Unit = {
+    val benchmark = new Benchmark("JSON functions", rows, output = output)
+
+    prepareDataInfo(benchmark)
+
+    val in = spark.range(0, rows, 1, 1).map(_ => """{"a":1}""")
+
+    benchmark.addCase("Text read", iters) { _ =>
+      run(in)
+    }
+
+    benchmark.addCase("from_json", iters) { _ =>
+      val schema = new StructType().add("a", IntegerType)
+      val from_json_ds = in.select(from_json('value, schema))
+      run(from_json_ds)
+    }
+
+    benchmark.addCase("json_tuple", iters) { _ =>
+      val json_tuple_ds = in.select(json_tuple($"value", "a"))
+      run(json_tuple_ds)
+    }
+
+    benchmark.addCase("get_json_object", iters) { _ =>
+      val get_json_object_ds = in.select(get_json_object($"value", "$.a"))
+      run(get_json_object_ds)
+    }
+
+    benchmark.run()
+  }
+
+  def jsonInDS(rows: Int, iters: Int): Unit = {
+    val benchmark = new Benchmark("Dataset of json strings", rows, output = 
output)
+
+    prepareDataInfo(benchmark)
+
+    val in = spark.range(0, rows, 1, 1).map(_ => """{"a":1}""")
+
+    benchmark.addCase("Text read", iters) { _ =>
+      run(in)
+    }
+
+    benchmark.addCase("schema inferring", iters) { _ =>
+      spark.read.json(in).schema
+    }
+
+    benchmark.addCase("parsing", iters) { _ =>
+      val schema = new StructType().add("a", IntegerType)
+      val ds = spark.read
+        .schema(schema)
+        .json(in)
+      run(ds)
+    }
+
+    benchmark.run()
+  }
+
+  def jsonInFile(rows: Int, iters: Int): Unit = {
+    val benchmark = new Benchmark("Json files in the per-line mode", rows, 
output = output)
+
+    withTempPath { path =>
+      prepareDataInfo(benchmark)
+
+      spark.sparkContext.range(0, rows, 1, 1)
+        .toDF("a")
+        .write
+        .json(path.getAbsolutePath)
+
+      benchmark.addCase("Text read", iters) { _ =>
+        val ds = spark.read
+          .format("text")
+          .load(path.getAbsolutePath)
+        run(ds)
+      }
+
+      benchmark.addCase("Schema inferring", iters) { _ =>
+        val ds = spark.read
+          .option("multiLine", false)
+          .json(path.getAbsolutePath)
+        ds.schema
+      }
+
+      val schema = new StructType().add("a", LongType)
+
+      benchmark.addCase("Parsing without charset", iters) { _ =>
+        val ds = spark.read
+          .schema(schema)
+          .option("multiLine", false)
+          .json(path.getAbsolutePath)
+        run(ds)
+      }
+
+      benchmark.addCase("Parsing with UTF-8", iters) { _ =>
+        val ds = spark.read
+          .schema(schema)
+          .option("multiLine", false)
+          .option("charset", "UTF-8")
+          .json(path.getAbsolutePath)
+
+        run(ds)
       }
 
       benchmark.run()
@@ -281,6 +383,9 @@ object JSONBenchmark extends SqlBasedBenchmark {
       countWideRow(500 * 1000, numIters)
       selectSubsetOfColumns(10 * 1000 * 1000, numIters)
       jsonParserCreation(10 * 1000 * 1000, numIters)
+      jsonFunctions(10 * 1000 * 1000, numIters)
+      jsonInDS(50 * 1000 * 1000, numIters)
+      jsonInFile(50 * 1000 * 1000, numIters)
     }
   }
 }


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[spark] branch master updated: [SPARK-27327][SQL] New JSON benchmarks: functions, Dataset[String]

Reply via email to