Re: [PATCH v2 1/2] powerpc64: Add optimized assembly for sha256-compress-n

Niels Möller Sun, 05 May 2024 07:10:26 -0700

Eric Richter <eric...@linux.ibm.com> writes:

> This patch introduces an optimized powerpc64 assembly implementation for
> sha256-compress-n. This takes advantage of the vshasigma instruction, as
> well as unrolling loops to best take advantage of running instructions
> in parallel.


Thanks. I'm now having a closer read of the assembly code. Comments below.

> +C ROUND(A B C D E F G H R EXT)
> +define(`ROUND', `
> +
> +     vadduwm VT1, VK, IV($9)               C VT1: k+W
> +     vadduwm VT4, $8, VT1                  C VT4: H+k+W
> +
> +     lxvw4x  VSR(VK), TK, K                C Load Key
> +     addi    TK, TK, 4                     C Increment Pointer to next key
> +
> +     vadduwm VT2, $4, $8                   C VT2: H+D
> +     vadduwm VT2, VT2, VT1                 C VT2: H+D+k+W
> +
> +     vshasigmaw      SIGE, $5, 1, 0b1111   C Sigma(E)  Se
> +     vshasigmaw      SIGA, $1, 1, 0        C Sigma(A)  Sa
> +
> +     vxor    VT3, $2, $3                   C VT3: b^c
> +     vsel    VT0, $7, $6, $5               C VT0: Ch.
> +     vsel    VT3, $3, $1, VT3              C VT3: Maj(a,b,c)
> +
> +     vadduwm VT4, VT4, VT0                 C VT4: Hkw + Ch.
> +     vadduwm VT3, VT3, VT4                 C VT3: HkW + Ch. + Maj.
> +
> +     vadduwm VT0, VT0, VT2                 C VT0: Ch. + DHKW
> +     vadduwm $8, SIGE, SIGA                C Anext: Se + Sa
> +     vadduwm $4, VT0, SIGE                 C Dnext: Ch. + DHKW + Se
> +     vadduwm $8, $8, VT3                   C Anext: Se+Sa+HkW+Ch.+Maj.
> +
> +
> +     C Schedule (data) for 16th round in future
> +     C Extend W[i]
> +     ifelse(`$10', `1', `
> +             vshasigmaw      SIGE, IV($9 + 14), 0, 0b1111
> +             vshasigmaw      SIGA, IV($9 + 1), 0, 0b0000
> +             vadduwm         IV($9), IV($9), SIGE
> +             vadduwm         IV($9), IV($9), SIGA
> +             vadduwm         IV($9), IV($9), IV($9 + 9)
> +     ')
> +')

I think it would be a bit simpler to take out the extend logic to its
own macro. 

> +define(`EXTENDROUND',        `ROUND($1, $2, $3, $4, $5, $6, $7, $8, $9, 1)')

If you do that, then you would define
  
  define(`EXTENDROUND', `ROUND($1, $2, $3, $4, $5, $6, $7, $8, $9) EXTEND($9)')

(In other related code, input expansion is done at the beginning of a
round iteration rather than at the end, but doing at the end like you do
may be better scheduling).

> +define(`LOAD', `
> +     IF_BE(`lxvw4x   VSR(IV($1)), 0, INPUT')
> +     IF_LE(`
> +             lxvd2x  VSR(IV($1)), 0, INPUT
> +             vperm   IV($1), IV($1), IV($1), VT0
> +     ')
> +     addi    INPUT, INPUT, 4
> +')
> +
> +define(`DOLOADS', `
> +     IF_LE(`DATA_LOAD_VEC(VT0, .load_swap, T1)')

Could you have a dedicated register for the permutation constant, and
load it only once at function entry? If you have general registers to
spare, it could also make sense to use, e.g., three registers for the
contant values 16, 32, 48, and use for indexing. Then you don't need to
update the INPUT pointer as often, and you can use the same constants
for other load/store sequences as well.

> +     LOAD(0)
> +     LOAD(1)
> +     LOAD(2)
> +     LOAD(3)

> +PROLOGUE(_nettle_sha256_compress_n)
> +     cmpwi   0, NUMBLOCKS, 0
> +     ble     0, .done
> +     mtctr   NUMBLOCKS
> +
> +     C Store non-volatile registers
> +     subi    SP, SP, 64+(12*16)
> +     std     T0,     24(SP)
> +     std     T1,     16(SP)
> +     std     COUNT,  8(SP)

For save/restore of registers, I prefer to use the register names, not
the defined symbols. And T0, T1, COUNT are defined to use r7, r8, r10,
which *are* volatile, right?

Does the data stored fit in the 288 byte "protected zone"? If so,
probably best to not modify the stack pointer.

> +     li      T0, 32
> +     stvx    v20, 0, SP
> +     subi    T0, T0, 16
> +     stvx    v21, T0, SP

Here it would also help a bit to allocate constants 16, 32, 48 in registers.

> +     subi    T0, T0, 16
> +     stvx    v22, T0, SP
> +     subi    T0, T0, 16
> +     stvx    v23, T0, SP
> +     subi    T0, T0, 16
> +     stvx    v24, T0, SP
> +     subi    T0, T0, 16
> +     stvx    v25, T0, SP
> +     subi    T0, T0, 16
> +     stvx    v26, T0, SP
> +     subi    T0, T0, 16
> +     stvx    v27, T0, SP
> +     subi    T0, T0, 16
> +     stvx    v28, T0, SP
> +     subi    T0, T0, 16
> +     stvx    v29, T0, SP
> +     subi    T0, T0, 16
> +     stvx    v30, T0, SP
> +     subi    T0, T0, 16
> +     stvx    v31, T0, SP
> +
> +     C Load state values
> +     li      T0, 16
> +     lxvw4x  VSR(VSA), 0, STATE      C VSA contains A,B,C,D
> +     lxvw4x  VSR(VSE), T0, STATE     C VSE contains E,F,G,H
> +
> +.loop:
> +     li      TK, 0
> +     lxvw4x  VSR(VK), TK, K
> +     addi    TK, TK, 4
> +
> +     DOLOADS
> +
> +     C "permute" state from VSA containing A,B,C,D into VSA,VSB,VSC,VSD

Can you give a bit more detail on this permutation? Does the main round
operations only use 32 bits each from the state registers? There's no
reasonable way to use a more compact representation?

> +     vsldoi  VSB, VSA, VSA, 4
> +     vsldoi  VSF, VSE, VSE, 4
> +
> +     vsldoi  VSC, VSA, VSA, 8
> +     vsldoi  VSG, VSE, VSE, 8
> +
> +     vsldoi  VSD, VSA, VSA, 12
> +     vsldoi  VSH, VSE, VSE, 12
> +
> +     EXTENDROUNDS
> +     EXTENDROUNDS
> +     EXTENDROUNDS
> +     NOEXTENDROUNDS
> +
> +     C Reload initial state from stack
> +     li      T0, 16
> +     lxvw4x  VSR(VT0), 0, STATE      C VSA contains A,B,C,D
> +     lxvw4x  VSR(VT1), T0, STATE     C VSE contains E,F,G,H
> +
> +     C Repack VSA,VSB,VSC,VSD into VSA,VSE for storing
> +     vmrghw  VSA, VSA, VSB
> +     vmrghw  VSC, VSC, VSD
> +     vmrghw  VSE, VSE, VSF
> +     vmrghw  VSG, VSG, VSH
> +
> +     xxmrghd VSR(VSA), VSR(VSA), VSR(VSC)
> +     xxmrghd VSR(VSE), VSR(VSE), VSR(VSG)
> +
> +     vadduwm VSA, VSA, VT0
> +     vadduwm VSE, VSE, VT1

It seems unfortunate to have to do this conversion for each iteration of
the loop, it would be nice if state could be converted to the
most efficient form before entering the loop, and not converted back
until after loop exit. But we probably don't have enoguh registers to
keep the old state exploded into many registers. And load/store of
exploded state doesn't seem that attractive either.

> +     li      T0, 16
> +     stxvw4x VSR(VSA), 0, STATE
> +     stxvw4x VSR(VSE), T0, STATE
> +
> +     bdnz    .loop

Regards,
/Niels Möller

-- 
Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677.
Internet email is subject to wholesale government surveillance.
_______________________________________________
nettle-bugs mailing list -- nettle-bugs@lists.lysator.liu.se
To unsubscribe send an email to nettle-bugs-le...@lists.lysator.liu.se

Re: [PATCH v2 1/2] powerpc64: Add optimized assembly for sha256-compress-n

Reply via email to