Hello Mamone,

On Mon, Jan 18, 2021 at 06:27:40PM +0200, Maamoun TK wrote:

> It would be nice to get the implementation of the enhanced algorithm
> working for both endian modes as it yields a good performance boost. Also,
> there is no much effort here, the only thing I'm struggling with is to get
> the binary built for Aarch64_be, I'm using Ubuntu on x86_64 as host and it
> seems there is no official package to cross compile for Aarch64_be.

Yes, there are no packages for aarch64_be in any mainstream distribution
I'm aware of. Buildroot and Gentoo are the ones I know that can target
it, Yocto likely as well. All are compile-yourself-distributions and not
for the faint of heart. Also, I've just learned that Buildroot has made
a concious decision not to produce native toolchains for the target. So
you can only ever cross-compile nettle to it, run it on an actual board
or under qemu and then go back to the cross-compiler on the host.

> > I did a search of the aarch64 instruction set and saw that there's zip1
> > and zip2 instructions. So as a first test I just changed zip to zip1
> > which made it compile. As was to be expected, the testsuite failed
> > though.
> >
> You are on the right track so far.

I've poked at the code a bit more and seemingly made the key init
function work by eliminiating all the BE specific macros and instead
adjusting the load from memory to produce the same register content. At
least register values and the final output to memory look the same in
an x/64xb $x0-64 and x64/xb $x0 for the first test cases in gcm-test
(which they did not before).

137         PMUL_PARAM v5,v29,v30
(gdb)
139         st1            {v27.16b,v28.16b,v29.16b,v30.16b},[x0]
(gdb)
141         ret
(gdb) x/64xb $x0-64
0xaaaaaaac5390: 0x77    0x58    0x14    0xdf    0xa9    0x97    0xd2    0xcd
[.. all the same on BE and LE ...]
0xaaaaaaac53c8: 0x0d    0x12    0x63    0x69    0x37    0x20    0xd3    0xfe
(gdb) x/64xb $x0
0xaaaaaaac53d0: 0xf9    0xfa    0x22    0xc3    0x02    0xe7    0x95    0x86
[.. all the same on BE and LE ...]
0xaaaaaaac5408: 0x45    0x91    0xbd    0x48    0x73    0xd9    0x8b    0x5c
(gdb)

The problem here once more seems to be that after a 128bit LE load which
is later used as two 64bit operands, not only the bytes of the operands
are reversed (which you already counter by rev64'ing them, I gather) but
the operands (doublewords) also end up transposed in the register. This
is something the rest of the routine expects but is only true on LE. So
I adjusted for it on BE in a very pedestrian way:

diff --git a/arm64/v8/gcm-hash.asm b/arm64/v8/gcm-hash.asm
index 1c14db54..74cd656a 100644
--- a/arm64/v8/gcm-hash.asm
+++ b/arm64/v8/gcm-hash.asm
@@ -55,17 +55,10 @@ C common macros:
 .endm

 .macro REDUCTION out
-IF_BE(`
-    pmull          T.1q,F.1d,POLY.1d
-    ext            \out\().16b,F.16b,F.16b,#8
-    eor            R.16b,R.16b,T.16b
-    eor            \out\().16b,\out\().16b,R.16b
-',`
     pmull          T.1q,F.1d,POLY.1d
     eor            R.16b,R.16b,T.16b
     ext            R.16b,R.16b,R.16b,#8
     eor            \out\().16b,F.16b,R.16b
-')
 .endm

     C void gcm_init_key (union gcm_block *table)
@@ -108,19 +101,11 @@ define(`H4M', `v29')
 define(`H4L', `v30')

 .macro PMUL_PARAM in, param1, param2
-IF_BE(`
-    pmull2         Hp.1q,\in\().2d,POLY.2d
-    ext            Hm.16b,\in\().16b,\in\().16b,#8
-    eor            Hm.16b,Hm.16b,Hp.16b
-    zip            \param1\().2d,\in\().2d,Hm.2d
-    zip2           \param2\().2d,\in\().2d,Hm.2d
-',`
     pmull2         Hp.1q,\in\().2d,POLY.2d
     eor            Hm.16b,\in\().16b,Hp.16b
     ext            \param1\().16b,Hm.16b,\in\().16b,#8
     ext            \param2\().16b,\in\().16b,Hm.16b,#8
     ext            \param1\().16b,\param1\().16b,\param1\().16b,#8
-')
 .endm

 PROLOGUE(_nettle_gcm_init_key)
@@ -128,6 +113,10 @@ PROLOGUE(_nettle_gcm_init_key)
     dup            EMSB.16b,H.b[0]
 IF_LE(`
     rev64          H.16b,H.16b
+',`
+    mov            x1,H.d[0]
+    mov            H.d[0],H.d[1]
+    mov            H.d[1],x1
 ')
     mov            x1,#0xC200000000000000
     mov            x2,#1

If my understanding is correct, we could avoid the doubleword swap for
both LE and BE if we were to load using ld1 to {H.b16} instead (with a
precalculation of the offset because ld1 won't take an immediate offset
that high, correct?). But then the rest of the routine would need to
change its expectation what H.d[0] and H.d[1] contain, respectively,
because they will no longer be transposed by neither the load on LE nor
an explicit swap on BE.

Somehow I have a feeling, I'm terribly missing the actual point here,
though. Are the zip instructions likely to give even further speedup
beyond the LE version? Could this be exploited for LE as well by
adjusting the loading scheme even more?

Also, it's not fully working yet. Before digging deeper I wanted to give
a bit of an update and get guidance as to how to proceed.

> > podman run -it -v ~/Downloads/nettle:/nettle
> I tried that but I'm having difficulty getting it work, it seems there is a
> problem in my system configuration that prevents podman establishing a
> socket for connection, I spend some time looking for alternative solutions
> with no chance. Do you have any other solutions? all what I can think of is
> either setup ssh connection or work together to get it work if you are into
> it!

I mulled this over from all directions. Access to the actual board is
somewhat complicated by the limits of my available Internet connections
(CGNAT being one, missing DMZ functionality on the routers another). It
can certainly be done, I just would need some time to set it up.

But I have made the cross-compiling and -debugging setup of the
container available on a vserver on the Net. Send me a mail directly
with an SSH ID public key if you'd like to try this out and I'll send
you instructions for login and use. We could meet up there in a
tmux/screen session and work on it together as well.

I have also tried to extract the buildroot toolchain from the image and
run it on my Gentoo box as well as Debian. It even seems relocatable, so
you can just put it anywhere and add it to PATH and it'll work. If you
want, I can put a tarball with the toolchain and qemu wrappers up on a
web server somewhere for you to grab. (I just thought, a container image
would be the easier delivery method nowadays. :)

Otherwise, what's your error message from podman? It's got no deamon, so
it shouldn't need a socket to connect to it like docker does. Out to the
Internet for image download it's also a standard client and respects
environment variables for proxies as usual.

rootless podman (running as your standard user instead of root) can take a
bit of tweaking before it stops throwing error messages but once that's
done it works nicely. I've never actually run podman as root by luck of
late birth with regards to containers.

Here's my command sequence on a Ubuntu 20.04 VM that's never seen
rootless podman before as per
https://www.vultr.com/docs/how-to-install-and-use-podman-on-ubuntu-20-04
(literally the first hit on search, can't vouch for the packages from
opensuse though):

michael@demo:~$ podman

Command 'podman' not found, did you mean:

  command 'pod2man' from deb perl (5.30.0-9ubuntu0.2)

Try: sudo apt install <deb name>

michael@demo:~$ source /etc/os-release
michael@demo:~$ sudo sh -c "echo 'deb 
http://download.opensuse.org/repositories/devel:/kubic:/libcontainers:/stable/xUbuntu_${VERSION_ID}/
 /' > /etc/apt/sources.list.d/devel:kubic:libcontainers:stable.list"
michael@demo:~$ wget -nv 
https://download.opensuse.org/repositories/devel:kubic:libcontainers:stable/xUbuntu_${VERSION_ID}/Release.key
 -O- | sudo apt-key add -
2021-01-19 21:13:19
URL:https://download.opensuse.org/repositories/devel:/kubic:/libcontainers:/stable/xUbuntu_20.04/Release.key
[1093/1093] -> "-" [1]
OK
michael@demo:~$ sudo apt-get update -qq
michael@demo:~$ sudo apt-get -qq --yes install podman fuse-overlayfs slirp4netns
[...]
michael@demo:~$ podman run -it 
michaelweisernettleci/buildroot:2020.11.1-aarch64_be-glibc-gdb
Completed short name "michaelweisernettleci/buildroot" with unqualified-search 
registries (origin: /etc/containers/registries.conf)
Trying to pull
docker.io/michaelweisernettleci/buildroot:2020.11.1-aarch64_be-glibc-gdb...
Getting image source signatures
Copying blob 6c33745f49b4 done
Copying blob ff35d554f2d5 done
Copying blob 3927b287d6b9 done
Copying blob 6bbc022f227c done
Copying config 21663e44fe done
Writing manifest to image destination
Storing signatures
root@06e70f1e12e4:/# aarch64_be-buildroot-linux-gnu-gcc -v
Using built-in specs.
COLLECT_GCC=/buildroot/output/host/bin/aarch64_be-buildroot-linux-gnu-gcc.br_real
COLLECT_LTO_WRAPPER=/buildroot/output/host/bin/../libexec/gcc/aarch64_be-buildroot-linux-gnu/9.3.0/lto-wrapper
Target: aarch64_be-buildroot-linux-gnu
Configured with: ./configure
--prefix=/buildroot/output/per-package/host-gcc-final/host
[...]
--enable-shared --disable-libgomp --silent
Thread model: posix
gcc version 9.3.0 (Buildroot 2020.11.1)
root@06e70f1e12e4:/# git clone https://git.lysator.liu.se/nettle/nettle
bash: git: command not found
root@06e70f1e12e4:/# apt-get update
Get:1 http://security.debian.org/debian-security buster/updates InRelease [65.4 
kB]
Get:2 http://deb.debian.org/debian buster InRelease [121 kB]
[...]
root@06e70f1e12e4:/# apt-get install git
Reading package lists... Done
Building dependency tree
Reading state information... Done
The following additional packages will be installed:
  ca-certificates git-man krb5-locales less libbsd0 libcurl3-gnutls
[...]
root@06e70f1e12e4:/# git clone https://git.lysator.liu.se/nettle/nettle
Cloning into 'nettle'...
warning: redirecting to https://git.lysator.liu.se/nettle/nettle.git/
remote: Enumerating objects: 721, done.
remote: Counting objects: 100% (721/721), done.
remote: Compressing objects: 100% (349/349), done.
remote: Total 21095 (delta 479), reused 593 (delta 372), pack-reused 20374
Receiving objects: 100% (21095/21095), 5.90 MiB | 3.47 MiB/s, done.
Resolving deltas: 100% (15748/15748), done.
root@06e70f1e12e4:/#

That was a lot easier than even I expected. Necessary stuff like entries
in /etc/subuid are automatically added by useradd as standard nowadays
without podman even being installed:

michael@demo:~$ cat /etc/subuid
michael:100000:65536

Hope that helps.

If all else fails and it's not too trying for your patience I'm up for
making it work iteratively by trial, error and discussion as above. ;)
-- 
Thanks,
Michael
_______________________________________________
nettle-bugs mailing list
[email protected]
http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs

Reply via email to