[GitHub] [spark] bersprockets commented on a change in pull request #32969: [WIP][SPARK-35817][SQL] Restore performance of queries against wide Avro tables

GitBox Sun, 20 Jun 2021 17:23:02 -0700


bersprockets commented on a change in pull request #32969:
URL: https://github.com/apache/spark/pull/32969#discussion_r654733861




##########
File path: 
external/avro/src/test/scala/org/apache/spark/sql/execution/benchmark/AvroWriteBenchmark.scala
##########
@@ -31,7 +36,34 @@ package org.apache.spark.sql.execution.benchmark
  *  }}}
  */
 object AvroWriteBenchmark extends DataSourceWriteBenchmark {
+  private def wideColumnsBenchmark: Unit = {
+    import spark.implicits._
+
+    withTempPath { dir =>
+      withTempTable("t1") {
+        val width = 1000
+        val values = 500000
+        val files = 20
+        val selectExpr = (1 to width).map(i => s"value as c$i")
+        // repartition to ensure we will write multiple files
+        val df = spark.range(values)
+          .map(_ => Random.nextInt).selectExpr(selectExpr: 
_*).repartition(files)
+          .persist(StorageLevel.DISK_ONLY)
+        // cache the data to ensure we are not benchmarking range or 
repartition
+        df.filter("(c1*c2) = 12").collect
+        df.createOrReplaceTempView("t1")
+        val benchmark = new Benchmark(s"Write wide rows into $files files", 
values)
+        benchmark.addCase("Write wide rows") { _ =>
+          spark.sql("SELECT * FROM t1").
+            
write.format("avro").save(s"${dir.getCanonicalPath}/${Random.nextLong.abs}")
+        }
+        benchmark.run()

Review comment:
       This is not quite working. Results for this benchmark get printed in 
stdout, but they don't show up in AvroWriteBenchmark-results.txt




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] bersprockets commented on a change in pull request #32969: [WIP][SPARK-35817][SQL] Restore performance of queries against wide Avro tables

Reply via email to