Hi Niels,

On Mon, Feb 12, 2018 at 08:59:16AM +0100, Niels Möller wrote:

> > Right. When this still didn't fix it, I compared little- and big-endian
> > behaviour and found that a.) vldm and vstm switch doublewords for no
> > reason I can see or find documentation about and b.) 
> By "doublewords", you mean 64-bit words, right?

Yes. ARM talks in bytes, halfwords, words, doublewords and quadwords.

> It might make sense to view it as big-endian or little-endian load of
> 128-bit values, and a 128-bit (16-byte) byte swap will then also swap
> the low and high 64-bit halves.
[...]
> If it's hard to find docs, I take it as a sign big-endian arm is a bit
> obscure...

Actually, it's all quite well-documented, just not always as obviously
as I'd like: The ARM ARM (Architecture Reference Manual) spells out the
low-level details. With additionally looking very closely at the gdb
output, I found for the chacha and salsa implementations:

1. There's no vldm or vstm on quadword registers in the architecture. It
gets translated into vldm on the corresponding number of doubleword
registers.

Disassembly of section .text:

00000000 <_nettle_chacha_core>:
   0:   ec910b10        vldmia  r1, {d0-d7}

This is hinted at here
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0204j/Bcfchhif.html
by saying: "If Q registers are specified, on disassembly they are shown
as D registers."

2. vldm and vstm on doubleword registers swap 32-bit words inside the
doubleword to get a full byte-swap in addition to the byte- and
halfword-swapping the word-access already does. Since chacha and salsa
input is a matrix of 32-bit words, the word swap transposes even and odd
columns (not doublewords):

// Combine the word-aligned words in the correct order for current endianness. 
D[d+r] = if BigEndian() then word1:word2 else word2:word1; 

3. The input to chacha-core is 32bit words in host endianness.

4. gdb's print output ordering is really confusing.

So all that's basically happening is that odd and even columns get
switched. The individual words' values are exactly the same because the
input is in host endianness already. So NEON doesn't adjust for
endianness after all.

What's been fooling me is that apparently gdb tries to show the values
of vector registers as if they had been stored to memory by an operation
of the full bit-size of the register shown and then read back again as
consecutive elements of various other sizes (8, 16, 32, 64-bit):

p/x $q0
le: u8 = {0x65, 0x78, 0x70, 0x61, 0x6e, 0x64, 0x20, 0x33, 0x32, 0x2d, 0x62, 
0x79, 0x74, 0x65, 0x20, 0x6b}
be: u8 = {0x79, 0x62, 0x2d, 0x32, 0x6b, 0x20, 0x65, 0x74, 0x61, 0x70, 0x78, 
0x65, 0x33, 0x20, 0x64, 0x6e}
          ^ bytes reversed by 128-bit store + read as byte sequence -> vldm
1:0:3:2 column swap still visible

le: u32 = {0x61707865, 0x3320646e, 0x79622d32, 0x6b206574}
be: u32 = {0x79622d32, 0x6b206574, 0x61707865, 0x3320646e}
           ^ bytes reversed by 128-bit store + read as four consecutive
big-endian 32-bit words + vldm column swap -> makes it appear
doublewords have been swapped

The realisation that even and odd columns get switched also explains the
necessary vext adjustments. So it's also not true that vext changes the
end of the vector where it extracts.

Regarding umac it's similar: vld1.8 loads a byte sequence from memory
without any swapping with either le or be. vld1.i32 reads the keys
stored in host endianness as words from memory. So the representation
ending up in the registers is the same as well which is why the code
doesn't need any adjustment.

Finally, the register switch for the return value with vmov in umac-nh
stems from the calling convention. AAPCS says:

"Fundamental types larger than 32 bits may be passed as parameters to, or
returned as the result of, function calls. When these types are in core
registers the following rules apply:
* A doubleword sized type is passed in two consecutive registers (e.g.,
r0 and r1, or r2 and r3). The content of the registers is as if the
value had been loaded from memory representation with a single LDM
instruction."

When loading a big-endian doubleword using ldm, the words end up in the
registers with the right values but transposed. Since the calling
convention mandates exactly this, we have to transpose the words upon
function exit as well.

Phew.

> Could you add a short note to arm/README with your findings?
> (It's quite some time since I did neon assembly, so I don't recall off
> the top of my head any details on what the various instructions, in
> particular vextr, do).

Done.

> > FAIL: sexp-conv
> > FAIL: nettle-pbkdf2
> > They've been failing all along. Can they be ignored?
> They're not that relevant to your changes, but I'd like to understand
> why they fail. What's the contents of the tools dir in your buld tree?
> You haven't done something like switched from building in the source
> tree build to a separate build tree, without a proper cleaning (make
> distclean) in the source tree?

No. But I have been ignoring an annoying build failure due to TeX being
missing. After reconfiguring with --disable-documentation build and
testsuite succeed. My bad.

> > Weeell, depends on what you consider easier: I haven't found any binary
> > distribution that supports armeb. Yocto and buildroot seem to support it
> > but still require compiling the whole thing.
> Hmm. Sounds more than a bit inconvenient.

The qemu-user chroot route with the linaro cross toolchain isn't too bad
actually:

cd 
$HOME/gcc-linaro-7.2.1-2017.11-x86_64_armeb-linux-gnueabihf/armeb-linux-gnueabihf/libc
cp /usr/bin/qemu-armeb-static usr/bin
wget https://gmplib.org/download/gmp/gmp-6.1.2.tar.lz
tar -xf gmp-6.1.2.tar.lz
cd gmp-6.1.2
# segfaults in qemu with -march=armv4 default
PATH=$PWD/../../../bin:$PATH CFLAGS="-march=armv7-a" ./configure 
--host=armeb-linux-gnueabihf --prefix=$PWD/../gmp
PATH=$PWD/../../../bin:$PATH make -j4 install

git clone https://git.lysator.liu.se/nettle/nettle.git
cd nettle
autoreconf
PATH=$PWD/../../../bin:$PATH ./configure --host=armeb-linux-gnueabihf 
--enable-arm-neon --with-include-path=$PWD/../gmp/include 
--with-lib-path=$PWD/../gmp/lib
PATH=$PWD/../../../bin:$PATH make -j4
NETTLE_TEST_ROOT=/nettle/testsuite PATH=$PWD/../../../bin:$PATH make -j4 check 
EMULATOR="sudo QEMU_SET_ENV=LD_LIBRARY_PATH=/nettle/.lib:/gmp/lib chroot 
$PWD/.."

with this small patch to run-tests:
diff --git a/run-tests b/run-tests
index 3d5655cf..bbc2bb4c 100755
--- a/run-tests
+++ b/run-tests
@@ -37,7 +37,7 @@ find_program () {
          ;;
        *)
          if [ -x "$1" ] ; then
-             echo "./$1"
+             echo "${NETTLE_TEST_ROOT:=.}/$1"
          else
              echo "$srcdir/$1"
          fi


> > Apple does do arm and someone could potentially want to build a fat
> > nettle that supports x86_64 and arm or rather arm and arm64.
> My concern is not breaking any setup which currently works, e.g, a non
> assebly "universal" build involving architectures with different
> endianness.

Right, that should be fine then.

> > Does nettle currently support being compiled fat with assembly at all?
> I don't think so. I'd expect one would have to build for one arch at a
> time, and have some postprocessing scripts to produce apple-fat
> libraries.

Apple have wrapped this in the compiler driver using multiple -arch
arguments. "gcc -arch x86_64 -arch arm" will run the compiler twice on
the same file and lipo the resulting objects together into a fat object.
The linker supports linking those into fat binaries.

If all the assembler implementations of the same routine were in one
file wrapped by #ifdefs the same could be done there. Otherwise,
assembly and lipoing would have to be done explicitly for those files.

# clang -v -arch x86_64 -arch i386 -c -o t.o t.c
[...]
Apple LLVM version 9.0.0 (clang-900.0.39.2)
Target: i386-apple-darwin17.4.0
Thread model: posix
InstalledDir: /Library/Developer/CommandLineTools/usr/bin
 "/Library/Developer/CommandLineTools/usr/bin/clang" -cc1 -triple
x86_64-apple-macosx10.13.0 ...
[...]
 "/Library/Developer/CommandLineTools/usr/bin/clang" -cc1 -triple
i386-apple-macosx10.13.0 ...
[...]
"/Library/Developer/CommandLineTools/usr/bin/lipo" -create -output t.o
/var/folders/ft/dp06pw254ybbzt42f1qn65pm0000gp/T/t-5eeded.o
/var/folders/ft/dp06pw254ybbzt42f1qn65pm0000gp/T/t-b25776.o
# file t.o
t.o: Mach-O universal binary with 2 architectures: [x86_64:Mach-O 64-bit
object x86_64] [i386:Mach-O object i386]
t.o (for architecture x86_64):  Mach-O 64-bit object x86_64
t.o (for architecture i386):    Mach-O object i386

> > But then I want to have a nice error message so as to not leave the user
> > with an aborted build and no apparent reason. :) Is this portable?
> According to
> http://pubs.opengroup.org/onlinepubs/9699919799/utilities/m4.html,
> errprint and m4exit are standard m4. (If they're also supported in
> practice is a different question, it's desirable to at least work with
> both GNU and BSD m4). If __file__ and __line__ are unportable, you could
> omit that. Since the error message reports a pretty global config
> problem, precise location isn't that important.

Not critical, __file__ and __line__ dropped. Net/Free/OpenBSD m4
support them though.

> > The patch got quite large now. Should I better make a series out of it?
> As you prefer, I think it is workable as is. It might help to split out
> the configure-related changes.

Series forthcoming.
-- 
Thanks,
Michael
_______________________________________________
nettle-bugs mailing list
nettle-bugs@lists.lysator.liu.se
http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs

Reply via email to