On Tue 09 Oct 2018 05:31:32 PM CEST, Daniel P. Berrangé wrote: > On Tue, Oct 09, 2018 at 05:30:25PM +0200, Alberto Garcia wrote: >> >> > for (i = 0; i < lim; i++) { >> >> > - xts_tweak_encdec(datactx, decfunc, src, dst, (uint8_t *)&T); >> >> > + xts_uint128 S, D; >> >> > + >> >> > + memcpy(&S, src, XTS_BLOCK_SIZE); >> >> > + xts_tweak_encdec(datactx, decfunc, &S, &D, &T); >> >> > + memcpy(dst, &D, XTS_BLOCK_SIZE); >> >> >> >> Why do you need S and D? >> > >> > I think src & dst pointers can't be guaranteed to be aligned >> > sufficiently for int64 operations, if we just cast from uint8t*. >> >> I see. I did a quick test without the memcpy() calls and it doesn't >> seem to have a visible effect on performance, but if it turns out >> that it does then maybe this is worth investigating further. I >> suspect all buffers received by this code are allocated with >> qemu_try_blockalign() anyway, so it should be safe. > > The extra memcpy() calls certainly had a perf impact when I added > them, so if we can determine that we can safely do without, that would > be desirable.
So I was having a look at this. From the block layer everything comes properly aligned. Then there's VirtioCrypto, which seems to allow XTS mode but I couldn't quite tell from virtio_crypto_sym_op_helper() if all buffers are guaranteed to be aligned. What you could do is a runtime check (with QEMU_PTR_IS_ALIGNED()) and decide what implementation to use. A couple of additional thoughts: - x86_64 (and others) allow unaligned memory accesses, and that might be faster than copying the buffer using memcpy(). I haven't measured it however. - qcrypto_block_{encrypt,decrypt}_helper() (used for encrypted block I/O) use the same buffer for input and output, so maybe it's worth exploring if this fact allows for additional optimization if you still need to use memcpy(). Berto