Re: [PR] [SPARK-46840][SQL][COLLATION] New Benchmarking Suite [spark]

via GitHub Thu, 21 Mar 2024 14:31:37 -0700


dbatomic commented on code in PR #45453:
URL: https://github.com/apache/spark/pull/45453#discussion_r1534709781



##########
sql/core/benchmarks/CollationBenchmark-jdk21-results.txt:
##########
@@ -0,0 +1,26 @@
+OpenJDK 64-Bit Server VM 21.0.2+13-LTS on Linux 6.5.0-1016-azure
+AMD EPYC 7763 64-Core Processor
+filter df column with collation:                     Best Time(ms)   Avg 
Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
+-----------------------------------------------------------------------------------------------------------------------------------
+filter df column with collation - UNICODE_CI                    40             
65          25          0.0     2001199.6       1.0X
+filter df column with collation - UNICODE                       19             
28           7          0.0      958487.5       2.1X
+filter df column with collation - UTF8_BINARY_LCASE             15             
18           4          0.0      773536.9       2.6X
+filter df column with collation - UTF8_BINARY                   14             
16           4          0.0      683145.3       2.9X
+
+OpenJDK 64-Bit Server VM 21.0.2+13-LTS on Linux 6.5.0-1016-azure
+AMD EPYC 7763 64-Core Processor
+collation unit benchmarks:                Best Time(ms)   Avg Time(ms)   
Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
+------------------------------------------------------------------------------------------------------------------------
+equalsFunction - UTF8_BINARY                          1              1         
  0          1.9         520.0       1.0X
+collator.compare - UTF8_BINARY                        1              1         
  0          1.1         922.8       0.6X
+hashFunction - UTF8_BINARY                            3              3         
  0          0.3        3149.9       0.2X
+equalsFunction - UTF8_BINARY_LCASE                   77             79         
  5          0.0       77352.5       0.0X

Review Comment:
   If I am reading this correctly, UTF8_BINARY is ~100x faster than anything 
else, which is expected due to extra allocation + memory copy. We have some 
work to do in order to get other collation to be in 2-3x factor of utf8_binary.
   
   Btw, I also ran you benchmarks on my machine. Everything looks valid.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-46840][SQL][COLLATION] New Benchmarking Suite [spark]

Reply via email to