phongn opened a new pull request, #13320:
URL: https://github.com/apache/trafficserver/pull/13320
## Summary
Adds SIMD-accelerated implementations of ASCII lowercasing
(`ts::ascii::tolower_copy` / `tolower_inplace`) and base64 encode/decode
(`ats_base64_encode` / `ats_base64_decode`), built on Google Highway and
selected at runtime by CPU capability. Both are gated behind a new build option
that **defaults to OFF** — without it the scalar paths are used and there is no
behavior change to existing builds.
This combines the previously separate to_lower and base64 SIMD efforts into
one series and folds in the review fixes applied on our internal branch.
## ASCII to_lower
- New `ts::ascii::tolower_copy(dst, src, n)` / `tolower_inplace(buf, n)` in
`include/tscore/ink_ascii_tolower.h`. Folds `A`–`Z` → `a`–`z`; all other bytes
(including 0x80–0xFF) pass through unchanged; no UTF-8 folding; in-place (`dst
== src`) supported.
- Highway runtime-dispatched kernel in `ink_ascii_tolower_dispatch.cc` (one
source compiled for SSE4/AVX2/AVX-512/NEON via `foreach_target`; the best
target for the live CPU is chosen once and cached). When the option is off, a
portable scalar loop is used.
- Migrated the hand-rolled tolower loops to the new API at the relevant call
sites — URL cache-key fast path (`URL.cc`), `HPACK.cc`, `QPACK.cc`,
`UrlRewrite.cc` — with behavioral tests added alongside each (`test_URL`,
`test_RemapRules`, `test_HpackIndexingTable`).
## base64
- Highway runtime-dispatched SIMD encode/decode
(`ink_base64_dispatch.{cc,h}`), using the vectorized base64 algorithms from
simdutf re-expressed in Highway (Muła/Lemire; aqrit's combined
standard/URL-safe classifier).
- Scalar primitives extracted to `ink_base64_scalar.h`, shared by the scalar
path and the SIMD path's tail so the two cannot drift. Decode fuses validation
into the SIMD loop and hands the remainder (including truncation at the first
non-alphabet byte) to the scalar tail, so SIMD output is byte-for-byte
identical to scalar — including in-place decode and mixed standard/URL-safe
alphabets.
- **Fixes a latent out-of-bounds read** in scalar `ats_base64_decode`: when
the decodable prefix length was not a multiple of four, the old loop ran one
iteration past the prefix (over-reading the input, and reading `inBuffer[-2]`).
Decode now processes only whole 4-character groups plus an explicit
2/3-character tail. The decoded length and bytes are unchanged for every
well-defined input.
## Build / wiring
- `ENABLE_HIGHWAY_DISPATCH` (default OFF) gates the SIMD paths via
`TS_HAS_HIGHWAY_DISPATCH`; `EXTERNAL_HWY` selects an external Highway over the
vendored copy.
- New `branch-highway` CMake preset builds with the option on, turning the
unit tests into real SIMD-vs-scalar parity checks.
- `NOTICE` updated to attribute simdutf and Google Highway.
## Performance
Measured on an Intel Xeon Gold 6338 (Ice Lake-SP, AVX-512), Release build
(`-O3`), Highway dispatching to its AVX-512 target. Baselines are the scalar
paths these replace. The public APIs keep the scalar path below the SIMD
thresholds (encode 24 B, decode 32 chars) to avoid dispatch overhead on tiny
inputs, which is why the smallest sizes show little gain.
**ASCII tolower** — ns per call, vs the byte-at-a-time `ink_tolower` loop:
| bytes | scalar (ns) | Highway (ns) | speedup |
|------:|------------:|-------------:|--------:|
| 8 | 5.9 | 7.9 | 0.7× |
| 16 | 12.6 | 5.0 | 2.5× |
| 32 | 21.8 | 4.5 | 4.9× |
| 64 | 41.2 | 5.6 | 7.3× |
| 256 | 175 | 12.0 | 14.6× |
| 1024 | 676 | 32.5 | 20.8× |
**base64 decode** — GB/s on input chars:
| chars | scalar | Highway | speedup |
|------:|-------:|--------:|--------:|
| 64 | 1.1 | 5.2 | 4.9× |
| 128 | 1.1 | 6.8 | 6.4× |
| 512 | 1.1 | 6.9 | 6.4× |
| 64 KB | 1.2 | 8.0 | 6.9× |
**base64 encode** — GB/s on input bytes:
| bytes | scalar | Highway | speedup |
|------:|-------:|--------:|--------:|
| 96 | 1.2 | 3.6 | 3.1× |
| 200 | 1.4 | 5.7 | 4.2× |
| 512 | 1.4 | 6.9 | 5.1× |
| 64 KB | 1.3 | 7.5 | 6.0× |
## Testing
- Unit tests for both features (`test_ink_ascii_tolower.cc`,
`test_ink_base64.cc`) compare the public path against an independent scalar
reference across sizes, alphabets, truncation, in-place, and buffer-bound
cases; with `ENABLE_HIGHWAY_DISPATCH=ON` they become SIMD-vs-scalar parity
tests.
- `tests/fuzzing/fuzz_base64.cc`: libFuzzer target that decodes untrusted
input and cross-checks both paths under sanitizers.
- `tools/benchmark/benchmark_ascii_tolower.cc` reproduces the tolower
numbers above.
- Builds and unit tests pass with the option both ON and OFF.
## Notes
- Depends on the vendored Google Highway copy (#13228).
- CI currently exercises only the scalar paths; add a job that configures
the `branch-highway` preset to get parity coverage of the SIMD kernels.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]