GideonPotok commented on code in PR #46526:
URL: https://github.com/apache/spark/pull/46526#discussion_r1600041876


##########
sql/core/benchmarks/CollationBenchmark-jdk21-results.txt:
##########


Review Comment:
   0.  Note, by the way that because we are relying on supportsBinaryEquality, 
this is about preserving not only the performance for UTF8_BINARY, but also 
that of UNICODE
   1. @uros-db check again. I believe the benchmarks are slightly more 
realistic now. In that for each string there are 3-6 that are equal by 
collation. EG:
   
   ```
    collation unit benchmarks - mode - 30105 elements:  Best Time(ms)   Avg 
Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    
---------------------------------------------------------------------------------------------------------------------------------
    UTF8_BINARY_LCASE - mode - 30105 elements                      6            
  6           0          5.1         195.6       1.0X
    UNICODE - mode - 30105 elements                                3            
  3           0         11.6          86.0       2.3X
    UTF8_BINARY - mode - 30105 elements                            3            
  3           0         11.6          85.9       2.3X
    UNICODE_CI - mode - 30105 elements                            12            
 12           1          2.6         382.9       0.5X
    ```
    
    2. Still a slowdown (though there is more work, so how would we expect any 
different). I willl run these new benchmarks on the other approaches and 
assuming this one is best, we can get this one ready for final stages of 
cleanup and review...
    
    3.  I will leave one benchmark for mode rather then having three for 
different input sizes...  that was just a temporary setup.
    
    5. An idea would be for the case of UTF8_BINARY and UNICODE to go through 
the `lower` operation first. This would be a better way to check that, as the 
design doc instructs: "Performance regression for case insensitive collation 
must be no worse than using upper() or ilike() explicitly" . Let me know 
whether to change the benchmark accordingly. There will probably still be a 
performance degradation, but it would at least be a fairer comparison.
    
    
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to