phongn opened a new pull request, #13320:
URL: https://github.com/apache/trafficserver/pull/13320

   ## Summary
   
   Adds SIMD-accelerated implementations of ASCII lowercasing 
(`ts::ascii::tolower_copy` / `tolower_inplace`) and base64 encode/decode 
(`ats_base64_encode` / `ats_base64_decode`), built on Google Highway and 
selected at runtime by CPU capability. Both are gated behind a new build option 
that **defaults to OFF** — without it the scalar paths are used and there is no 
behavior change to existing builds.
   
   This combines the previously separate to_lower and base64 SIMD efforts into 
one series and folds in the review fixes applied on our internal branch.
   
   ## ASCII to_lower
   
   - New `ts::ascii::tolower_copy(dst, src, n)` / `tolower_inplace(buf, n)` in 
`include/tscore/ink_ascii_tolower.h`. Folds `A`–`Z` → `a`–`z`; all other bytes 
(including 0x80–0xFF) pass through unchanged; no UTF-8 folding; in-place (`dst 
== src`) supported.
   - Highway runtime-dispatched kernel in `ink_ascii_tolower_dispatch.cc` (one 
source compiled for SSE4/AVX2/AVX-512/NEON via `foreach_target`; the best 
target for the live CPU is chosen once and cached). When the option is off, a 
portable scalar loop is used.
   - Migrated the hand-rolled tolower loops to the new API at the relevant call 
sites — URL cache-key fast path (`URL.cc`), `HPACK.cc`, `QPACK.cc`, 
`UrlRewrite.cc` — with behavioral tests added alongside each (`test_URL`, 
`test_RemapRules`, `test_HpackIndexingTable`).
   
   ## base64
   
   - Highway runtime-dispatched SIMD encode/decode 
(`ink_base64_dispatch.{cc,h}`), using the vectorized base64 algorithms from 
simdutf re-expressed in Highway (Muła/Lemire; aqrit's combined 
standard/URL-safe classifier).
   - Scalar primitives extracted to `ink_base64_scalar.h`, shared by the scalar 
path and the SIMD path's tail so the two cannot drift. Decode fuses validation 
into the SIMD loop and hands the remainder (including truncation at the first 
non-alphabet byte) to the scalar tail, so SIMD output is byte-for-byte 
identical to scalar — including in-place decode and mixed standard/URL-safe 
alphabets.
   - **Fixes a latent out-of-bounds read** in scalar `ats_base64_decode`: when 
the decodable prefix length was not a multiple of four, the old loop ran one 
iteration past the prefix (over-reading the input, and reading `inBuffer[-2]`). 
Decode now processes only whole 4-character groups plus an explicit 
2/3-character tail. The decoded length and bytes are unchanged for every 
well-defined input.
   
   ## Build / wiring
   
   - `ENABLE_HIGHWAY_DISPATCH` (default OFF) gates the SIMD paths via 
`TS_HAS_HIGHWAY_DISPATCH`; `EXTERNAL_HWY` selects an external Highway over the 
vendored copy.
   - New `branch-highway` CMake preset builds with the option on, turning the 
unit tests into real SIMD-vs-scalar parity checks.
   - `NOTICE` updated to attribute simdutf and Google Highway.
   
   ## Performance
   
   Measured on an Intel Xeon Gold 6338 (Ice Lake-SP, AVX-512), Release build 
(`-O3`), Highway dispatching to its AVX-512 target. Baselines are the scalar 
paths these replace. The public APIs keep the scalar path below the SIMD 
thresholds (encode 24 B, decode 32 chars) to avoid dispatch overhead on tiny 
inputs, which is why the smallest sizes show little gain.
   
   **ASCII tolower** — ns per call, vs the byte-at-a-time `ink_tolower` loop:
   
   | bytes | scalar (ns) | Highway (ns) | speedup |
   |------:|------------:|-------------:|--------:|
   | 8     | 5.9         | 7.9          | 0.7×    |
   | 16    | 12.6        | 5.0          | 2.5×    |
   | 32    | 21.8        | 4.5          | 4.9×    |
   | 64    | 41.2        | 5.6          | 7.3×    |
   | 256   | 175         | 12.0         | 14.6×   |
   | 1024  | 676         | 32.5         | 20.8×   |
   
   **base64 decode** — GB/s on input chars:
   
   | chars | scalar | Highway | speedup |
   |------:|-------:|--------:|--------:|
   | 64    | 1.1    | 5.2     | 4.9×    |
   | 128   | 1.1    | 6.8     | 6.4×    |
   | 512   | 1.1    | 6.9     | 6.4×    |
   | 64 KB | 1.2    | 8.0     | 6.9×    |
   
   **base64 encode** — GB/s on input bytes:
   
   | bytes | scalar | Highway | speedup |
   |------:|-------:|--------:|--------:|
   | 96    | 1.2    | 3.6     | 3.1×    |
   | 200   | 1.4    | 5.7     | 4.2×    |
   | 512   | 1.4    | 6.9     | 5.1×    |
   | 64 KB | 1.3    | 7.5     | 6.0×    |
   
   ## Testing
   
   - Unit tests for both features (`test_ink_ascii_tolower.cc`, 
`test_ink_base64.cc`) compare the public path against an independent scalar 
reference across sizes, alphabets, truncation, in-place, and buffer-bound 
cases; with `ENABLE_HIGHWAY_DISPATCH=ON` they become SIMD-vs-scalar parity 
tests.
   - `tests/fuzzing/fuzz_base64.cc`: libFuzzer target that decodes untrusted 
input and cross-checks both paths under sanitizers.
   - `tools/benchmark/benchmark_ascii_tolower.cc` reproduces the tolower 
numbers above.
   - Builds and unit tests pass with the option both ON and OFF.
   
   ## Notes
   
   - Depends on the vendored Google Highway copy (#13228).
   - CI currently exercises only the scalar paths; add a job that configures 
the `branch-highway` preset to get parity coverage of the SIMD kernels.
   
   🤖 Generated with [Claude Code](https://claude.com/claude-code)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to