nikolamand-db commented on code in PR #45816:
URL: https://github.com/apache/spark/pull/45816#discussion_r1553730550
##########
sql/core/benchmarks/CollationBenchmark-results.txt:
##########
@@ -2,26 +2,26 @@ OpenJDK 64-Bit Server VM 17.0.10+7-LTS on Linux
6.5.0-1016-azure
AMD EPYC 7763 64-Core Processor
collation unit benchmarks - equalsFunction: Best Time(ms) Avg Time(ms)
Stdev(ms) Rate(M/s) Per Row(ns) Relative
--------------------------------------------------------------------------------------------------------------------------
-UTF8_BINARY_LCASE 34122 34152
42 0.0 341224.2 1.0X
-UNICODE 4520 4522
2 0.0 45201.8 7.5X
-UTF8_BINARY 4524 4526
2 0.0 45243.0 7.5X
-UNICODE_CI 52706 52711
7 0.0 527056.1 0.6X
+UTF8_BINARY_LCASE 8006 8022
24 0.0 80056.6 1.0X
+UNICODE 3151 3152
3 0.0 31505.3 2.5X
+UTF8_BINARY 3152 3164
17 0.0 31517.9 2.5X
+UNICODE_CI 54159 54258
140 0.0 541591.6 0.1X
OpenJDK 64-Bit Server VM 17.0.10+7-LTS on Linux 6.5.0-1016-azure
AMD EPYC 7763 64-Core Processor
collation unit benchmarks - compareFunction: Best Time(ms) Avg Time(ms)
Stdev(ms) Rate(M/s) Per Row(ns) Relative
---------------------------------------------------------------------------------------------------------------------------
-UTF8_BINARY_LCASE 33467 33474
10 0.0 334671.7 1.0X
-UNICODE 51168 51168
1 0.0 511677.4 0.7X
-UTF8_BINARY 5561 5593
45 0.0 55610.9 6.0X
-UNICODE_CI 51929 51955
36 0.0 519291.8 0.6X
+UTF8_BINARY_LCASE 11169 11175
8 0.0 111691.2 1.0X
+UNICODE 49021 49052
45 0.0 490209.1 0.2X
+UTF8_BINARY 6415 6415
0 0.0 64145.8 1.7X
+UNICODE_CI 50373 50385
18 0.0 503725.4 0.2X
OpenJDK 64-Bit Server VM 17.0.10+7-LTS on Linux 6.5.0-1016-azure
AMD EPYC 7763 64-Core Processor
collation unit benchmarks - hashFunction: Best Time(ms) Avg Time(ms)
Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
-UTF8_BINARY_LCASE 22079 22083
5 0.0 220786.7 1.0X
-UNICODE 177636 177709
103 0.0 1776363.9 0.1X
-UTF8_BINARY 11954 11956
3 0.0 119536.7 1.8X
-UNICODE_CI 158014 158038
35 0.0 1580135.7 0.1X
+UTF8_BINARY_LCASE 24485 24506
30 0.0 244846.2 1.0X
Review Comment:
Maybe it's better to skip hash optimizations for now as hashing of data
blocks requires internal mixing functions
https://github.com/apache/spark/blob/383bb4af004253e1eb84d3f3e58347e0d7670f66/common/unsafe/src/main/java/org/apache/spark/unsafe/hash/Murmur3_x86_32.java#L74-L75
but we must supply data generated on fly as stream because we want to do
char-by-char lowercase and this is still not supported in internal hash
implementation.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]