It made little difference: LDC compiled into AVX2 vectorized addition
(vpmovzxbq & vpaddq.)

Measurements without -mcpu=native:
overhead 0.336s
bytes    0.610s
without branch hints 0.852s
code pasted 0.766s

So we should be able to reduce overhead by means of proper code arrangement and interplay of inlining and outlining. The prize, however, would be to get the AVX instructions for ASCII going. Is that possible? -- Andrei

AVX for ascii ?
What are you referring to ?
Most text processing is terribly incompatible with simd.
sse 4.2 has a few instructions that do help, but as far as I am aware it is not yet too far spread.

