Re: [PR] [SPARK-54698][SQL] Support hashing for all data types for array set like operations [spark]

via GitHub Sat, 13 Dec 2025 05:48:45 -0800


Kimahriman commented on PR #53468:
URL: https://github.com/apache/spark/pull/53468#issuecomment-3649449843


   I created a simple benchmark to test:
   ```scala
   object ArraySetLikeBenchmark extends SqlBasedBenchmark {
     private val N = 1000L
     private val arrayElements = 100000
   
     override def runBenchmarkSuite(mainArgs: Array[String]): Unit = {
       val benchmark = new Benchmark(s"Array Set Like", N, output = output)
   
       val arr = (1 to arrayElements).map(x => Array(x, x)).toArray
       benchmark.addCase("array_union", 1) { _ =>
         spark.range(N)
           .select(array_union(lit(arr), lit(arr)).alias("arr"))
           .write
           .format("noop")
           .mode("append")
           .save()
       }
       benchmark.run()
     }
   }
   ```
   
   Before:
   ```
   info] Array Set Like:                           Best Time(ms)   Avg Time(ms) 
  Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   [info] 
------------------------------------------------------------------------------------------------------------------------
   [info] array_distinct                                    56198          
56198           0          0.0    56197860.3       1.0X
   ```
   
   After:
   ```
   [info] Array Set Like:                           Best Time(ms)   Avg 
Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   [info] 
------------------------------------------------------------------------------------------------------------------------
   [info] array_distinct                                     3113           
3113           0          0.0     3112680.3       1.0X
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-54698][SQL] Support hashing for all data types for array set like operations [spark]

Reply via email to