Michael Weiser <michael.wei...@gmx.de> writes:

> See the attached patch for my current approach to fixing it, which is
> explicit transposing, adding and then transposing again to be as
> transposed as the other operands. 

I haven't yet read the code, but I have some comments based on your
description only.

> I wonder if the surrounding C code
> could be changed to supply that part of the state as a 64-bit doubleword
> in host endianness to the assembler routine to cut down on adjustment.

I think it will be a bit cumbersum to change the interface to the C
code.

> Alternatively, could the 64-bit operation be broken down into two 32-bit
> operations which implicitly adjust to the transposed 32-bit words on BE?

Maybe. But we still need to propagate the carry, can that be done in a
better way than transpose, 64-bit add, transpose?

> I've tried to document what I see in the registers on armeb to get a
> handle on how to proceed:
>
>       vtrn.32 X0, Y3          C X0:  0  0  2  2  Y3:  1  1  3  3
>       vtrn.32 X1, Y0          C X1:  4  4  6  6  Y0:  5  5  7  7
> -     vtrn.32 X2, Y1          C X2:  8  8 10 10  Y1:  9  9  1  1 <- typo?
> +     vtrn.32 X2, Y1          C X2:  8  8 10 10  Y1:  9  9 11 11

Indeed a typo. I just checked in the fix, thanks!

>       vtrn.32 X3, Y2          C X3: 12 12 14 14  Y2: 13 13 15 15
> +                             C BE:
> +                             C X0:  3  3  1  1  Y3:  2  2  0  0
> +                             C X1:  7  7  5  5  Y0:  6  6  4  4
> +                             C X2: 11 11  9  9  Y1: 10 10  8  8
> +                             C X3: 15 15 13 13  Y2: 14 14 12 12

Also, it's somewhat important to keep track of which block a word
belongs to. In the LE code, X0 really is A0 B0 A2 B2, where A refers to
the first block, and B to the second.

What's the layout before the transpose, immediately after load? I'd
guess you get X0: 1 0 3 2?   

For the little endian code, the transpose can be viewed as

  X0:  A0 A1 A2 A3
         /     /    denotes elements swapped.
  Y3:  B0 B1 B2 B3

If instead we start with the order 1 0 3 2, we get the same result (but
with registers swapped) if we do

  Y3:  B1 B0 B3 B2
         \     \
  X0:  A1 A0 A3 A2

So I would expect there's some clever way to get the BE case to work
with about the same number of transpose instructions, even if initial
word order is somewhat different.

> I wonder if the code working on them contains some symmetry that could
> be exploited to (with minimal changes) get correct results on these
> transposed matrices.

At least, both blocks are treated equally (except that the initial
counter addition is done to only the second block, and that the final result
is written in the right order. So it doesn't matter if X0 contains A0 B0
A2 B2 or B0 A0 B2 A2. And unlike the one-way code, we only use 

  vext32 ... #2

to rotate data between rounds, never #1 or #3.

> Otherwise I wonder if it would be possible for both chacha and salsa to
> change the actual loading and storing so there's no transposing of
> 32-bit operands. I looked at vld4.32 but that does some fancy
> de-interleaving and needs two operations to load four q registers.

The new powerpc code uses load and store instructions that behave the
same in this respect, for both BE and LE. But not sure if there's any
easy way on ARM. I'm not that familiar with the more special load and
store instructions. Would vst2.32 be useful in some way for the final
store (and vst3.32 for chacha-3core)?

> Otherwise we'd need a lot of vrev64.u32s to basically revert the 32-bit
> transposition happening upon load and save to end up with identical
> matrices to LE.

If that's an easier way to get it working, I think it's a good start.
I'd expect that's still give a reasonable speedup over the 1-way
version.

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.
_______________________________________________
nettle-bugs mailing list
nettle-bugs@lists.lysator.liu.se
http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs

Reply via email to