Niels Möller <[email protected]> writes:
> In initial benchmarking, this loop appears to run in 4.2 cycles per
> iteration on my laptop, and a slowdown by a factor of 3 compared to the
> current C implementation of ghash_update. Penalty may be a bit less for
> assembly implementation, but I haven't tried.
I've got the code to work, and I've written an x86_64 assembly
implementation using sse2 instructions. Code on the
ghash-sidechannel-silent branch. On my laptop, I seem toget these
numbers:
Old C implementation: 350 MB/s
Old asm implementation: 388 MB/s,
New C implementation: 116 MB/s
New asm implementation: 196 MB/s
pclmul implementation: 4047 MB/s
In the new asm code, the inner loop is
.Loop_bit:
movaps ONE, M0
pand X, M0
pcmpeqd ONE, M0
pshufd $0xaa, M0, M1
pshufd $0, M0, M0
psrlq $1, X
pand (KEY, CNT), M0
pand 1024(KEY, CNT), M1
pxor M0, R
pxor M1, R
add $16, CNT
jnz .Loop_bit
it appears to run in 283 cycles/block, or 4.4 cycles per iteration of
above loop. I think that indicates that the bottleneck is instruction
issue, 3 instructions per cycle on this processor (AMD Ryzen 5). It's a
bit annoying that it takes as many as 5 instructions to extend the two
bits from X (bit indices 0 and 64) to the two 128-bit mask words M0, M1.
Maybe there's some more clever way?
I can see some possible improvements; one could use the sign bit
instead, replacing the first three instructinos by two: movaps X, M0;
psrlq $63, M0. Or one could do 4 bits (e.g., sign bits 127, 95, 63, 31)
instead of just 2, wit only two more pshufd to create the additonal
masks. Together, I think that would be a loop of 17 instructions for
doing 4 bits.
Regards,
/Niels
--
Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677.
Internet email is subject to wholesale government surveillance.
_______________________________________________
nettle-bugs mailing list -- [email protected]
To unsubscribe send an email to [email protected]