pitrou opened a new pull request, #45217:
URL: https://github.com/apache/arrow/pull/45217
### Rationale for this change
The Rank implementation currently mixes ties/duplicates detection and rank
computation in a single function `CreateRankings`. This makes it poorly
reusable for other Rank-like functions such as the Percentile Rank function
proposed in GH-45190.
### What changes are included in this PR?
Split duplicates detection into a dedicated function that sets a marker bit
in the sort-indices array (it is private to the Rank implementation, so it is
safe to mutate it).
The rank computation itself (`CreateRankings`) becomes simpler and,
moreover, it does not need to read the input values: it becomes therefore
type-agnostic.
This yields a code size reduction (around 45kB saved on the author's
machine):
* before:
```console
$ size /build/build-release/relwithdebinfo/libarrow.so
text data bss dec hex filename
26072218 353832 2567985 28994035 1ba69f3
/build/build-release/relwithdebinfo/libarrow.so
```
* after:
```console
$ size /build/build-release/relwithdebinfo/libarrow.so
text data bss dec hex filename
26028198 353832 2567985 28950015 1b9bdff
/build/build-release/relwithdebinfo/libarrow.so
```
Rank benchmark results are mostly neutral, though there are slight
improvements on some benchmarks, and slight regressions especially on all-nulls
input.
### Are these changes tested?
Yes, by existing tests.
### Are there any user-facing changes?
No.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]