Re: [PR] [WIP][SPARK-47353][SQL] Enable collation support for the Mode expression using GroupMapReduce [spark]

via GitHub Tue, 14 May 2024 09:15:12 -0700


GideonPotok commented on code in PR #46526:
URL: https://github.com/apache/spark/pull/46526#discussion_r1600041876

##########
sql/core/benchmarks/CollationBenchmark-jdk21-results.txt:
##########

Review Comment:
0. Note, by the way that because we are relying on supportsBinaryEquality,
this is about preserving not only the performance for UTF8_BINARY, but also
that of UNICODE
1. @uros-db check again. I believe the benchmarks are slightly more
realistic now. In that for each string there are 3-6 that are equal by
collation. EG:

```
collation unit benchmarks - mode - 30105 elements: Best Time(ms) Avg
Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative

---------------------------------------------------------------------------------------------------------------------------------
UTF8_BINARY_LCASE - mode - 30105 elements 6
6 0 5.1 195.6 1.0X
UNICODE - mode - 30105 elements 3
3 0 11.6 86.0 2.3X
UTF8_BINARY - mode - 30105 elements 3
3 0 11.6 85.9 2.3X
UNICODE_CI - mode - 30105 elements 12
12 1 2.6 382.9 0.5X
```

2. Still a slowdown (though there is more work, so how would we expect any
different). I willl run these new benchmarks on the other approaches and
assuming this one is best, we can get this one ready for final stages of
cleanup and review...

3. I will leave one benchmark for mode rather then having three for
different input sizes... that was just a temporary setup.

5. An idea would be for the case of UTF8_BINARY and UNICODE to go through
the `lower` operation first. This would be a better way to check that, as the
design doc instructs: "Performance regression for case insensitive collation
must be no worse than using upper() or ilike() explicitly" . Let me know
whether to change the benchmark accordingly. There will probably still be a
performance degradation, but it would at least be a fairer comparison.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [WIP][SPARK-47353][SQL] Enable collation support for the Mode expression using GroupMapReduce [spark]

Reply via email to