bersprockets commented on a change in pull request #32969:
URL: https://github.com/apache/spark/pull/32969#discussion_r655794745
##########
File path:
external/avro/src/test/scala/org/apache/spark/sql/execution/benchmark/AvroWriteBenchmark.scala
##########
@@ -31,7 +36,34 @@ package org.apache.spark.sql.execution.benchmark
* }}}
*/
object AvroWriteBenchmark extends DataSourceWriteBenchmark {
+ private def wideColumnsBenchmark: Unit = {
+ import spark.implicits._
+
+ withTempPath { dir =>
+ withTempTable("t1") {
+ val width = 1000
+ val values = 500000
+ val files = 20
+ val selectExpr = (1 to width).map(i => s"value as c$i")
+ // repartition to ensure we will write multiple files
+ val df = spark.range(values)
+ .map(_ => Random.nextInt).selectExpr(selectExpr:
_*).repartition(files)
+ .persist(StorageLevel.DISK_ONLY)
+ // cache the data to ensure we are not benchmarking range or
repartition
+ df.filter("(c1*c2) = 12").collect
Review comment:
I think I can replace this with `df.noop()`. I will check.
I was using `collect` to force the evaluation (and caching) of df. But I
didn't want to actually collect 500K wide rows, so I filtered away all of df's
rows.
The whacky filter expression `(c1*c2) = 12` was to ensure no push-down to
the file format's libraries (so that df actually reads all rows). Not really
needed in the case of Avro.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]