Re: [AArch64] Optimize GHASH

Maamoun TK Sun, 24 Jan 2021 08:45:02 -0800

Hello Michael,

On Sun, Jan 24, 2021 at 3:15 PM Michael Weiser <[email protected]>
wrote:


> I think there might be a misunderstanding here (possibly caused by
> my attemps at explaining what ldr does, sorry):
>
> On arm(32) and aarch64, endianness is also exclusively handled on
> load and store operations. Register layout and operation behaviour is
> identical in both modes. I think ARM also speaks of "memory endianness"
> for just that reason. There is no adjustable "CPU endianness". It's
> always "CPU-native".
>
> So pmull will behave exactly the same in BE and LE mode. We just have
> to make sure our load operations put the operands in the correct (i.e.
> CPU-native) representation into the correct vector register indices upon
> load.
>
> So as an example:
>
> pmull2 v0.1q,v1.2d,v2.2d
>
> will always work on d[2] of v1 and v2 and put the result into all of v0.
> And it expects its operands there in exactly one format, i.e. the least
> significant bit at one end and the most-significant bit at the other
> (and it's the same ends/bits in both memory-endianness modes :). And it
> will
> also store to v0 in exactly the same representation in LE and BE mode.
> Nothing changes with an endianness mode switch.
>
> That's where load and store come in:
>
> ld1 {v1.2d,v2.2d},[x0]
>
> will load v1 and v2 with one-dimensional vectors from memory. So v1.d[0]
> will be read from x0+0, v1.d[1] from x0+8 (bytes) and v2.d[0] from x0+16
> and v2.d[1] from x0+24. That'll also be the same in LE and BE mode
> because that's the structure of the vector prescribed by the load
> operation we choose. Endianness will be applied to the individual
> doublewords but the order in which they're loaded from memory and in
> which they're put into d[0] and d[1] won't change, because they're
> vectors.
>
> So if you've actually stored a vector from CPU registers using
> st1 {v1.2d, v2.2d},[x0]
> and then load them back using
> ld1 {v1.2d, v2.2d},[x0]
> there's nothing else that needs to be done. The individual bytes of the
> doublewords will be stored LE in memory in LE mode and BE in BE mode but
> you won't notice. And the order of the doublewords in memory will be the
> same in both modes.
>
> If you're loading something that isn't stored LE or has no endianness at
> all, e.g. just a sequence of data bytes (as in DATA in our code) or
> something that was explicitly stored BE even on an LE CPU (as in
> TABLE[128] in our code, I gather) but want to treat it as a larger
> datatype, then you have to define endianness and need to apply
> correction. That's why we need to rev64 in one mode (e.g. LE) to get the
> same register-content in both endianness modes if what's loaded isn't
> actually stored in that endianness in memory.
>
> Another way is to explicitly load a vector of bytes using ld1 {v1.16b,
> v2.16b},[x0]. Then you can be sure what you get as register content, no
> matter what memory endianness the CPU is using. If it's really just a
> sequence of data bytes stored in their correct and necessary order in
> memory and we only want to apply shifts and logical operations to each
> of them, we'd be all set.
>
> Here we can also exploit but need to be careful to understand the
> different views on the register, so the fact that b[0] through b[7] is
> mapped to d[0] and that b[0] will be the least significant byte in d[0]
> and b[7] will be MSB. This layout is cpu-native, i.e. also the same in
> both endianness modes. It's just that an ld1 {v1.16b} will always load a
> vector of bytes with eight elements as consecutive bytes from memory
> into b[0] through b[7], so it'll always be an LSB-first load when
> interpreted as a larger data type. If we then look at that data trough
> d[0] it will appear reversed if it isn't really a doubleword that was
> stored little-endian.
>
> That's why an ld1 {v1.b16,v2.b16},[x0] will produce incorrect results
> with a pmull2 v0.1q,v1.2d,v2.2d in at least one endianness because we're
> telling one operation that it's dealing with a byte-vector and the other
> expects us to provide a vector of doublewords. If what we're loading is
> actually something that was stored as doublewords in current memory
> endianness, then ld1 {v1.2d,v2.2d} is the correct load operation. If
> it's data bytes we want to *treat* as a big-endian doubleword, we can
> use either ld1 {v1.16b,v2.16b} or {v1.2d,v2.2d} but in both cases need
> to rev64 the register content if memory endianness is LE.
>
> Now what *ldr* does is load a single 128bit quadword. And this will
> indeed transpose the doublewords in BE mode when looked at through d[0]
> and d[1]. Because as a big-endian load it will of course load the byte
> at x0 into the most significant byte of e.g. v2, i.e. v2.d[1], i.e.
> v2.b[15] and not v2.d[0], i.e. v2.b[7] (as with ld1.2d) or v2.b[0] (as
> with ld1.16b). Similarly, x0+15 will go into v2.b[0] in BE and v2.b[15]
> in LE mode. So this will only make sense if what we're loading was
> actually stored using str as a 128bit quadword in current memory
> endianness. If it's a sequence of bytes (st1.16b) we want to treat as a
> vector of doublewords, not only will the bytes appear inverted when
> looked at through d[0] and d[1] but also what's in d[0] will be in d[1]
> in the other endianness mode and vice-versa. If it's a vector of
> doublewords in memory endianness (st1.2d), byte order in the register
> will be correct in both modes (because it's different in memory) but
> d[0] and d[1] will still be transposed.
>
> That's where all my rambling about doubleword transposition came from.
> Does that make sense?
>
> I just found this document from the LLVM guys with pictures! :)
> https://llvm.org/docs/BigEndianNEON.html
>
> BTW: ARM even goes as far as always storing *instructions* themselves,
> so the actual opcodes the CPU decodes and executes, little-endian, even
> in BE binaries. So the instruction fetch and decode stage always
> operates little-endian. When the instruction is executed it's then just
> an additional flag that tells load and store instructions how to behave
> when executed and accessing memory. (I'm actually extrapolation from
> what I know to be true for classic arm32 but it makes sense for that to
> be true for aarch64 as well.)
>

That explains everything, it also explains why ld1 instruction reverse the
byte order according to loading type on BE and always maintain the same
order on LE. The non memory related instructions maintain the same behavior
as it should no matter what the endianness mode they run on. Thanks for the
detailed explanation.
This scheme has a couple of advantages:
- Taking advantage of performance benefit of LE data layout on both memory
and registers side.
- Eliminating the overhead caused by transposing data order for every
potential load/store operation on LE since it's a more popular mode.

I think to gather you (same as me) prefer to think in big-endian
> representation. As for arm and aarch64, little-endian is the default, do
> you think, the routine could be changed to move the special endianness
> treatment using rev64 to BE mode, i.e. avoid them in the standard LE
> case? It's certainly beyond me but it might give some additional
> speedup.
>
> Or would it be irrelevant compared to the speedup already given by using
> pmull in the first place?


I don't know how it gonna affect the performance but it's irrelevant margin
indeed, TBH I liked the patch with the special endianness treatment but
it's up to you to decide!


> > > And as always after all this guesswork I have found a likely very
> > > relevant comment in gcm.c:
> > >
> > >   /* Shift uses big-endian representation. */
> > > #if WORDS_BIGENDIAN
> > >   reduce = shift_table[x->u64[1] & 0xff];
> > >
> > > Is that it? Or is TABLE just internal to the routine and we can store
> > > there however we please? (Apart from H at TABLE[128] initialised for us
> > > by gcm_set_key and stored BE?)
> > >
> > The assembly implementation of GHASH has a whole different scheme from C
> > table-lookup implementation, you don't have to worry about any of that.
>
> Perfect. So whether we use ld1/st1.16b or ld1/st1.2d for TABLE doesn't
> matter. I wouldn't expect it but we could benchmark whether one is faster
> than the other though!?
>

Yeah, it doesn't matter since gcm_init_key() and gcm_hash() are the only
functions that use the table. keeping it ld1/st1.16b is fine, either way
there is a table layout at header of the file that gives a sense about the
table structure for the assembly implementation scheme.


> For clarification: How is H, i.e. TABLE[128] defined an interface to
> gcm_set_key? I see that gcm_set_key calls a cipher function to fill it.
> So I guess it provides the routine with a sequence of bytes  (similar to
> DATA), i.e. the key, which will be the same on LE and BE and we *treat*
> it as a big-endian doubleword for the sake of using pmull on it.
> Correct?
>

subkey 'H' value is calculated by enciphering (usually using AES) a
sequence of ZERO data, then gcm_set_key() assign the calculated value
(subkey 'H') at the middle of TABLE array, that is TABLE[80], the remaining
fields of array are meant to be filled by C gcm_init_key() routine to
server as assistance subkeys for C table-look implementation. Since the
assembly implementation uses a different scheme, we don't need those
assistance subkeys so we grab the main subkey (H) value from the middle of
the table and hook our needed assistance values on this table in order to
be used by gcm_hash(). Hope it makes sense for you, let me know if you want
to hear further explanation.

regards,
Mamone
_______________________________________________
nettle-bugs mailing list
[email protected]
http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs

Re: [AArch64] Optimize GHASH

Reply via email to