Cryptography-Digest Digest #793

Digestifier Wed, 23 Dec 1998 18:14:28 -0500
Cryptography-Digest Digest #793, Volume #8       Wed, 23 Dec 98 18:13:02 EST

Contents:
  AES candidate performance on the Alpha 21164 processor (version 1) (Kenneth Almquist)

----------------------------------------------------------------------------

From: [EMAIL PROTECTED] (Kenneth Almquist)
Subject: AES candidate performance on the Alpha 21164 processor (version 1)
Date: 23 Dec 1998 21:40:58 GMT

The Alpha 21164 processor is an interesting platform for comparison
of AES candidates.  It is one of the most advanced microprocessors
available today, and thus provides some indication of what midrange
processors will look like a few years from now.  In particular, it
is a 64 bit processor.  Since Intel is moving to a 64 bit architecture
(Merced), it seems safe to assume that within a few years, 64 bit
processors will be the norm for personal computers.  It would be
even more interesting to know how the candidates will perform on
the Merced, but Intel hasn't released the specifications of the
Merced yet.


1.  The Numbers

I've computed some timing numbers for the Alpha 21164 processor, using
the documentation available from the digital.com web site.  To get
these numbers, I coded the body of a single round, multiplied by the
number of rounds, and added the extra cost of whitening (if used).

        algorithm      cycles   rounds * cycles + extra
        -----------------------------------------------
        DFC            304       8       38        0
        Rijndael-128   340      10       34        0
        Twofish        368      16       22.5      8
        HPC            376?      8       47        0
        Crypton        408      12       34        0
        Rijndael-192   408      12       34        0
        RC6            467      20       23        7
        Rijndael-256   476      14       34        0
        CAST-256       600      48       12.5      0
        Serpent        947?     31       30       17


These numbers are optimistic because they don't include overhead
such as loading the data to be encrypted into registers.  They also
don't include cache misses.  (The data cache on the 21164 contains
8K bytes, and all the algorithms require less.)  The three entries
for Rijndael are for Rijndael with different key sizes.

The designers of two of the algorithms have provided numbers for
implementations on the Alpha:

        algorithm      cycles
        ---------------------
        DFC:           428
        HPC:           421

In both cases, these numbers are significantly higher than mine.  I
would expect them to be larger because my numbers are optimistic, but
the differences are large enough that I assume the reported numbers
are either for the Alpha 21064 processor (which is slower than the
21164 processor) or are due to less efficient coding.

I apologize if I have omitted your favorite algorithm.  I gave
preference to algorithms which are freely usable, seem to have a
chance of being selected as the AES standard, and/or can use the
features of a 64 bit architecture.


2.  Comments on the Algorithms

The Alpha 21164 can execute two integer instructions in parallel.
Crypton, Rijndael, Twofish, and (probably) Serpent all have sufficient
parallelism to keep the processor occupied.  DFC, HPC, and RC6 do not.
Here are my comments on the individual algorithms:

Decorrelated Fast Cipher (DFC), which is relatively slow on 32 bit
machines, lives up to its name when run on the Alpha.  The round
function is conceptually simple, but includes a 64 bit multiply and
a modulo operation with the modulus 2**64+13.  These operations
are slow, even on the Alpha.  DFC comes out on top because the
number of rounds is small.  (Note that DFC is provably secure
against many attacks, so it is not necessarily less secure than
competitors with more rounds.)

Twofish has a relatively complex round function, but it has enough
parallelism that two instructions can be executed every cycle,
leading to the smallest number of cycles per round.

Rijndael and Crypton have the same structure, differing only in
the tables used.  Each round modifies the entire block, so the
number of table lookups per round is twice that of Twofish.
Otherwise, the round function is quite simple.  The round
function has enough parallelism that two instructions can be
executed every cycle.  With 128 bit keys, Rijndeal uses 10 rounds,
and comes out slightly ahead of Twofish.  Crypton uses 12 rounds,
and Rijndeal uses 12 and 14 rounds with 192 and 256 bit keys,
respectively, so these are slower than Twofish.

The 128 bit block version of Hasty Pudding Cypher (HPC) has only eight
rounds, but each round can be thought of as 8 Festial rounds, so the
cipher can also be views as consisting of 64 quick Festial rounds.
The round function is such a mind-numbing mixture of operations
shifts, adds, subtracts, xors, and table lookups that I've put a
question mark next to my cycle count.  The main problem with the round
function is that it has little parallelism.  My guess is that the
cipher was developed for the Alpha 21064 chip, which I believe could
execute only one instruction per cycle.  With HPC, the second pipeline
on the 21164 processor is mostly idle.  The 512 bit block variant of
HPC appears to have more parallelism, and thus may be a better match
for the 21164 processor.

(As an aside, I got a chuckle out of the HPC AES submission.  Commenting
on the difficulty of implementing HPC on smart cards, which generally
contain 128 bytes (at most 256 bytes) of RAM, the submission deadpans:
"The minimum memory requirement of 2KB may be a challenge".)

RC6 uses 32 bit multiplication, and has the fastest encryption speed
of all the AES candidates when run on the Pentium Pro, which has a
fast hardware multiplier.  The 21164 also has a fast 32 bit
multiplier.  However, RC6 also performs four rotates per round (two of
them variable), and these are not particularly efficient on 64 bit
machines.  The round function would be two instructions (one cycle)
shorter if the variable rotations rotated right rather than left.  The
same difference in instruction count is likely to occur on a "typical"
64 bit machine.  There seems to be no cryptographic reason for RC6 the
variable rotations to left rather than rotate right, so the choice of
rotation direction appears to indicate that performance on 64 bit
machines was not a high priority for the RC6 designers.

CAST-128 has the fastest round function, but each round only modifies
1/4 of the block.  So it takes four rounds, or 50 cycles, to modify
every bit in a block.  This is slower than any of the other ciphers
except DFC.  CAST-128 has a reasonable amount of parallelism.  It runs
about 8.7% slower than it would if it could be scheduled to execute
two instructions every cycle.

Serpent is by far the slowest of the algorithms considered here.  The
value given in the table is an estimate because I have not worked out
the implementations of the S-boxes.  My guess is that the estimate is
on the high side, with the cost per round being closer to 29.5 than to
30.  The most time consuming part of the cipher is the linear mixing,
which includes four 32 bit rotates.


3.  The Implementations

In this section, I provide the implementations used to calculate the
performance numbers given in section 1.  This section is rather long,
so many readers will want to stop here, but it seems only fair to
include the basis for my comparison so that others can check my work
if they desire.

Rather than giving Alpha instructions, I present the implementations
in pseudocode, where each pseudocode statement corresponds to one
Alpha instruction.  If two instructions can be executed in on cycle, I
put them on the same line, so each line can be executed in a single
cycle.  Idle processor cycles are indicated by lines containing the
word "wait."

When a cipher uses round keys, I assume that a pointer named rkp
points to the round key array.  I give the subscripts for the first
iteration of the loop.  If it doesn't affect performance, I adjust
rkp to point to the next set of round keys.  Otherwise, subsequent
iterations use different subscript values.  (That is not a problem
because I assume that the loops are fully unrolled, meaning that
the constants can be different in each iteration).

Most Alpha instructions can only take 0 through 255 as the value of a
constant operand.  The exception is in address calculation, which adds
a 16 bit offset to the value of a register.  Other constants used in
the code have to be preloaded into registers.  In the pseudocode, I
simply use constants without indicating whether they are encoded in
the instruction or taken from registers, since the execution time is
the same either way (except for the cost of loading the registers
prior to encryption, which is small).


3.1.  RC6

The 32 bit rotates are computed by concatenating the 32 bit value
with itself to form a 64 bit value, and then shifting right.  A rotate
right of X by N bits is

        (X << 32 | X & 0xFFFFFFFF) >> N

The rotate operator used in the RC6 specification rotates left, and
masks the shift count, so it is implemented as

        (X << 32 | X & 0xFFFFFFFF) >> (32 - (N & 0x1F))

The 32 bit multiply operation takes 8 cycles, and the second 32 bit
multiply must start 4 cycles after the first.  Also, the operands
to the multiply instruction must be calculated two cycles in advance.

Because one multiply instruction must start later than the other, I
calculate
        B * (2 * B + 1)
as
        B * B * 2 + B
(Credit goes to Doug Whiting for this optimization.)  Otherwise, the
code is straightforward.

The instruction schedule below executes at the rate of one cycle
per line.  It consists of 26 lines, but the last three lines can
overlap the first three, so in an unrolled loop each round takes
23 cycles.

        r2 := D + D
        r2 := r2 + 1
        r4 := B * B
        r0 := rkp[0];  r1 := rkp[1]
        rkp := rkp + 8  
        wait
        r5 := D * r2
        wait
        wait
        wait
        r4 := r4 + r4
        r4 := r4 + B
        r2 := r4 << 32;  r3 := r4 & 0xFFFFFFFF
        r4 := r2 | r3
        r2 := r5 << 32;  r3 := r5 & 0xFFFFFFFF
        r5 := r2 | r3;  t := r3 >> 27
        u := r4 >> 27;  A := A xor t
        r8 := u & 0x1F;  r7 := t & 0x1F
        r10 := A << 32;  r9 := A & 0xFFFFFFFF
        A := r9 | r10;  r8 := 32 - r8
        A := A >> r8;  C := C xor r4
        r7 := 32 - r7;  A := A + r0     -- r0 contains S[2*i]
        r1 := C << 32;  r2 := C & 0xFFFFFFFF
        C := r1 | r2
        C := C >> r7
        C := C + r1                     -- r1 contains S[2*i+1]

The initial whitening operation is implemented as

        r0 := rkp[0];  r1 := rkp[1]
        rkp := rkp + 8
        B := B xor r0;  D := D xor r1

The final whitening operation is

        C := r1 | r2;  r2 := rkp[2]
        C := C >> r7;  r3 := rkp[3]
        C := C + r1;  A := A xor r2
        C := C xor r3

This includes the last operations of the final round, which would
otherwise not be accounted for.


3.2.  DFC

We must perform two multiply operations because an Alpha instruction
can only produce a 64 bit output, and we want the 128 bit product.
The Alpha instruction to produce the high order bits takes 14 clock
cycles, while the one to produce the low order bits takes only 12
cycles, so we perform them in that order.  The second multiply
operation must start 8 cycles after the first.

The first round key is needed right at the start of the iteration,
so we prefetch it into r0 during the preceding round.  (This means
that before the first round, we preload r0 with the first round key.)

The modulo 2*64+13 operation first uses the reduction suggested in
the DFC submission:

        (Q * 2**64 + R) modulo 2**64+13
              = (13 * ((2**64 - 1) - Q) + 182 + R) modulo 2**64+31

This gives us a value less than 15 * 2**64, which we then reduce using

        (Q' * 2**64 + R') modulo 2**64+13 = (R' - 13 * Q') modulo 2**64+13

Since the Alpha doesn't have a carry bit, we split numbers into 32
bit chunks before performing most operations.  The exceptions are the
multiply operation and the computation of ((2**64 - 1) - Q).  The
latter computation cannot overflow, and is performed by taking the
bitwise complement of Q (which I write as ~Q).

Multiplication of a number X by 13 is done using the formula
        X * 13 = ((X * 4) + ((X * 8) + X))
which can be evaluated in two instructions on the Alpha.

When performing the second reduction, where I subtract 13*Q' from
R', the result could be negative, either because I'm working with
only the low 32 bits of R', and I need to do a borrow operation,
or because R' is actually less than 13*Q'.  My understanding is
that the low 32 bits of R' should behave like a uniformly distributed
random number.  This being the case, and since Q' <= 14, the
probability of the subtraction yielding a negative value is so
small that I ignore the cost of handling this condition in my
cycle count, although I do of course include the cost of testing
for it.  The code to handle this case appears at the end of the
section.

The round table (RT) is specified to be a table of 32 bit values,
but it is xored with the high order bits, so we implement it as
a table of 64 bit values, with the left shift 32 operations
precalculated.


        r1 := rkp[1]
        wait
        r2 := high_bits(A * r0)
        wait
        wait
        wait
        wait
        wait
        wait
        wait
        r3 := A * r0;   rkp := rkp + 16
        r0 := rkp[0]
        r14 := r1 >> 32; r15 := r1 & 0xFFFFFFFF
        wait
        wait
        wait
        r2 := ~r2
        r4 := r2 >> 32;  r5 := r2 & 0xFFFFFFFF
        r6 := (r4 << 3) + r4;  r7 := (r5 << 3) + r5
        r6 := (r4 << 2) + r6;  r7 := (r5 << 2) + r7
        wait
        wait
        r8 := r3 >> 32;  r9 := r3 & 0xFFFFFFFF
        r8 := r8 + r14; r9 := r9 + r15
        r6 := r6 + r8;  r7 := r7 + r9
        r7 := r7 + 182
        r10 := r7 >> 32;  r7 := r7 & 0xFFFFFFFF
        r6 := r6 + r10
        r10 := r6 >> 32;  r6 := r6 & 0xFFFFFFFF
        r11 := (r10 << 3) + r10;  r12 := r6 & 0x3F
        r11 := (r10 << 2) + r11;  r12 := (r12 << 3) + RT
        r7 := r7 - r11
        if r7 < 0 goto L3;  r13 := r7 << 32
L1:     r12 := load(r12);  r13 := r13 | r6
        r13 := r13 xor KC
        r13 := r13 xor r12
        r13 := r13 + KD
        B := B xor r13
        -- iteration ends here --


L2:     r12 := r6 & 0x3F;  r13 := r7 << 32
        r12 := (r12 << 3) + RT;  goto L1
L3:     r7 := r7 + (1 << 32);  r6 := r6 - 1
        if r6 >= 0 goto L2
        r7 := r7 + 13
        if r7 <= 0xFFFFFFFF goto L2
        r7 := r7 & 0xFFFFFFFF
        r6 +:= 1
        goto L2


3.3.  HPC

HPC straightforward but tedious to schedule for the Alpha.  The value
of i is the round number.  I have coded this assuming that the loop
is fully unrolled, meaning that i is a constant.  Making i a variable
would increase the number of instructions, but not the execution time
since there are lots of cycles where currently only one instruction is
executing.

        r0 := s0 & 0xFF
        r0 := (r0 << 3) + KX
        k := load(r0)
        r2 := spice[i xor 4]
        s1 := s1 + k;  r3 := k << 8
        s0 := s0 xor r3;  r4 := s1 >> 11
        s0 := s0 - r4;  r5 := s1 << 2
        s0 := s0 xor r5
        s0 := s0 - r2
        r6 := s0 << 32
        r6 := r6 xor (PI19 + 128)
        s0 := s0 + r6
        r7 := s0 >> 17
        s0 := s0 + r7
        r8 := s0 >> 34;  t := spice[i]
        s0 := s0 + r8
        s0 := s0 xor t;  r10 := t << 5
        s0 := s0 + r10;  t := t >> 4
        s1 := s1 + t;  s0 := s0 xor t
        r12 := s0 & 0x1F
        r12 := r12 + 22
        r13 := s0 << r12
        s0 := s0 + r13
        r14 := s0 >> 23;  r15 := spice[i xor 7]
        s0 := s0 xor r14
        s0 := s0 - r15
        t := s0 & 0xFF
        t := (t << 8) + KX
        k := load(t); t := t + (8 * 3*i+1)
        kk := load(t)
        s1 := s1 xor k
        r18 := kk >> 8;  kk := kk xor k
        s0 := s0 xor r18;  r19 := kk >> 5
        s1 := s1 + r19;  r20 := kk << 12
        s0 := s0 - r20;  r21 := kk & ~0xFF
        s0 := s0 xor r21
        s1 := s1 + s0
        r22 := s1 << 3;  r23 := spice[i xor 2]
        s0 := s0 + r22
        s0 := s0 xor r23;  r24 := KX[i + 128 + 16]
        s0 := s0 + r24
        r25 := s0 << 22;  r27 := spice[i xor 1]
        s0 := s0 + r25;  r26 := s1 >> 4
        s0 := s0 xor r26
        s0 := s0 + r27
        r29 := s0 >> (i + 33)
        s0 := s0 xor r29


3.4.  Twofish

Because Twofish has plenty of potential parallelism, what I give here
is simply a collection of instructions.  Scheduling these to execute
at a rate of two per clock cycle is straightforward.  I list them this
way so I can group related instructions together.

The data to be encrypted is stored in four registers called x0 through
x3.  The only tricky part of this code is the rotate operations.  The
rotate of x1 by 8 bits is performed as part of the operation where we
split x1 into 8 bit quantities to feed into the s-boxes.  The rotates
by one bit are implemented as

        rotate_left(X, 1) = (X << 1) - (X >> 31)

The shift of 31 arranges for us to subtract -1 (i.e. add positive 1)
if bit 31 of X is 1.  For this to work on a 64 bit machine, the high
order bits of X must contain the sign extension of bit 32 of X.  The
Alpha has 32 bit add and subtract operations, which calculate the
result, truncate to 32 bits, and then sign extend the truncated result
to 64 bits.  All addition and subtraction operations in the Twofish
code should be interpreted as these 32 bit operations.  Without this
special feature of the Alpha architecture, we would expect each of the
rotate operations to take an additional instruction, adding two to the
instruction count and one to the number of cycles per round.

I've named the tables sb0 through sb3, and stored them in a structure
pointed to by expkey.


        r0 := x0 & 0xFF
        r0 := (r0 << 2) + expkey
        r0 := load(r0 + offset_of(sb0))
        r1 := x0 >> 8 & 0xFF
        r1 := (r1 << 2) + expkey
        r1 := load(r1 + offset_of(sb1))
        r2 := x0 >> 16 & 0xFF
        r2 := (r2 << 2) + expkey
        r2 := load(r2 + offset_of(sb2))
        r3 := x0 >> 24 & 0xFF
        r3 := (r3 << 2) + expkey
        r3 := load(r3 + offset_of(sb3))
        r4 := r0 xor r1
        r5 := r2 xor r3
        t0 := r4 xor r5

        r6 := x1 >> 8 & 0xFF
        r6 := (r6 << 2) + expkey
        r6 := load(r6 + offset_of(sb0))
        r7 := x1 >> 16 & 0xFF
        r7 := (r7 << 2) + expkey
        r7 := load(r7 + offset_of(sb1))
        r8 := x1 >> 24 & 0xFF
        r8 := (r8 << 2) + expkey
        r8 := load(r8 + offset_of(sb2))
        r9 := x1 & 0xFF
        r9 := (r9 << 2) + expkey
        r9 := load(r9 + offset_of(sb3))
        r10 := r6 xor r7
        r11 := r8 xor r9
        t1 := r10 xor r11

        r12 := x3 << 1
        r13 := x3 >> 31
        x3 := r12 - r13

        r14 := rkp[0]
        r15 := t0 + t1
        r15 := r15 + r14
        x2 := x2 xor r15

        r16 := rkp[1]
        r17 := t1 + t1
        r17 := r17 + t0
        r17 := r17 + r16
        x3 := x3 xor r17

        r18 := x2 << 1
        r19 := x2 >> 31
        x2 := r18 - r19


3.5.  Rijndael and Crypton

As with Twofish, each table lookup takes three instructions:

        r0 := x0 & 0xFF
        r0 := (r0 << 2) + expkey
        r0 := load(r0 + offset_of(sb0))

I won't bother to write out all 48 instructions used to implement the
16 table lookups.  After placing the table lookup results into r0
through r15, the remaining instructions are:

        x0 := rkp[0]
        x0 := x0 xor r0
        x0 := x0 xor r1
        x0 := x0 xor r2
        x0 := x0 xor r3
        x1 := rkp[1]
        x1 := x1 xor r4
        x1 := x1 xor r5
        x1 := x1 xor r6
        x1 := x1 xor r7
        x2 := rkp[2]
        x2 := x2 xor r8
        x2 := x2 xor r9
        x2 := x2 xor r10
        x2 := x2 xor r11
        x3 := rkp[3]
        x3 := x3 xor r12
        x3 := x3 xor r13
        x3 := x3 xor r14
        x3 := x3 xor r15

These instructions are easily scheduled to execute at the rate of two
per cycle on the 21164.


3.6.  Serpent

The execution time for Serpent on the 21164 is an estimate because I
haven't worked out the S-boxes.  Other than that, the impleemntation
is straightforward.  The linear mixing phase of Serpent uses 32 bit
rotate operations, which are implemented as

        X <<< N = (X << 32 | X & 0xFFFFFFFF) >> (32 - N)

The data to be encrypted is stored as 32 bit quantities in four
registers named A, B, C, and D.

The xor of the data with the round key is implemented as follows:

        r0 := rkp[0];  r1 := rkp[1]
        r2 := rkp[2];  r3 := rkp[3]
        A := a xor r0; B := B xor r1
        C := C xor r2; D := D xor r3

The Serpent s-box operations are conceptually simple, but a fast
implementation has to covert each of the S-box implemenations to
boolean operations.  This takes an average of about 17 instructions,
which have a fair amount of parallelism.  This takes 8.5 cycles
assuming that two instructions can be executed every cycle.  I will
round up (to 9 cycles), since that will give the right answer even
if one of the instructions cannot be paired.

The linear mixing operation involves a bunch of rotate operations,
which we implement using the technique described in the implemetation
of RC6.

        r0 := A << 32;  r1 := A & 0xFFFFFFFF
        r2 := r0 | r1;  r3 := C << 32
        A := A >> (32 - 13);  r4 := C & 0xFFFFFFFF
        r5 := r3 | r4;  r6 := A << 3
        C := r5 >> (32 - 3);  D := D xor r6
        D := D xor C;  B := B xor A
        r7 := D << 32;  r8 := D & 0xFFFFFFFF
        r9 := r7 xor R8;  B := B xor C
        r10 := B << 32;  r11 := B & 0xFFFFFFFF
        r12 := r10 | r11;  B := B xor A
        D := r12 >> (32 - 1);  A := A xor B
        A := A xor D;  r13 := B << 7
        C := C xor D;  r14 := A << 32
        C := C xor r13;  r15 := A & 0xFFFFFFFF
        r17 := C << 23;  r18 := C & 0xFFFFFFFF
        r16 := r14 | r15;  r19 := r17 | r18
        A := r16 >> (32 - 5)
        C := r19 >> (22 - 5)

This is 34 instructions, which I will count as 17 cycles.  (The
latency is 18 cycles, but we can execute two load instructions
from the "xor with key" operation during the last two cycles.)
The time per round is therefore

         4 cycles for xor with key
         9 cycles for sbox operations (estimated)
        17 cycles for mixing operations
        30 cycles total (estimated)

The last round replaces the mixing operations with another xor with
key, giving a time of 17 cycles for the last round.  In the table I
list 31 regular rounds and an additional time of 17 clocks to account
for the last round.


3.7.  CAST-256

The specification uses a 32 bit rotate left operation.  Since the
amount of rotation is controlled by the key, we assume that the round
key array has been preprocessed to convert the rotate left amounts into
rotote right amounts.  We implement rotate right as

        rotate_right(X, N) = (X << 32 | X & 0xFFFFFFFF) >> N

The round function uses the other round key right at the start, so
we preload this round key into r0 as part of the preceding round.

In the reverse quad-rounds, a single round can be executed with the
following instructions.  Since three rounds can be executed in
parallel, these operations can be fully parallelized, resulting in
a time per quad-round of 46 cycles.

        r1 := rkp[1]
        r2 := A + r0
        r0 := rkp[2]
        r3 := r2 << 32
        r4 := r2 & 0xFFFFFFFF
        r5 := r3 | r4
        r5 := r5 >> r1
        r6 := r5 >> 24 & 0xFF
        r6 := r6 << 2 + S1
        r6 := load(r6)
        r7 := r7 << 2 + S2
        r7 := r5 >> 16 & 0xFF
        r7 := load(r7)
        r8 := r5 >> 8 & 0xFF
        r8 := r8 << 2 + S3
        r8 := load(r8)
        r9 := r5 & 0xFF
        r9 := r9 << 2 + S4
        r9 := load(r9)
        r10 := r6 xor r7
        r10 := r10 - r8
        r10 := r10 + r9
        D := D xor r10

In the forward quad-rounds, the rounds must be performed serially.  The
execution schedule for a forward quad-round is:

        r2 := D + r0;  r1 := rkp[1]
        r3 := r2 << 32;  r4 := r2 & 0xFFFFFFFF
        r5 := r3 | r4;  r0 := rkp[2]
        r5 := r5 >> r1
        r6 := r5 >> 24 & 0xFF;  r9 := r5 & 0xFF
        r6 := r6 << 2 + S1;  r7 := r5 >> 16 & 0xFF
        r6 := load(r6);  r7 := r7 << 2 + S2
        r7 := load(r7);  r8 := r5 >> 8 & 0xFF
        r8 := r8 << 2 + S3;  r9 := r9 << 2 + S4
        r8 := load(r8);  r9 := load(r9)
        r10 := r6 xor r7
        r10 := r10 - r8
        r10 := r10 + r9;  C := C xor r0
        C := C xor r10;  r1 := rkp[3]

        r3 := r2 << 32;  r4 := r2 & 0xFFFFFFFF
        r5 := r3 | r4;  r0 := rkp[4]
        r5 := r5 >> r1
        r6 := r5 >> 24 & 0xFF;  r9 := r5 & 0xFF
        r6 := r6 << 2 + S1;  r7 := r5 >> 16 & 0xFF
        r6 := load(r6);  r7 := r7 << 2 + S2
        r7 := load(r7);  r8 := r5 >> 8 & 0xFF
        r8 := r8 << 2 + S3;  r9 := r9 << 2 + S4
        r8 := load(r8);  r9 := load(r9)
        r10 := r6 - r7
        r10 := r10 + r8;  B := B xor r9
        B := B xor r10

        r2 := B - r0;  r1 := rkp[5]
        r3 := r2 << 32;  r4 := r2 & 0xFFFFFFFF
        r5 := r3 | r4;  r0 := rkp[6]
        r5 := r5 >> r1
        r6 := r5 >> 24 & 0xFF;  r9 := r5 & 0xFF
        r6 := r6 << 2 + S1;  r7 := r5 >> 16 & 0xFF
        r6 := load(r6);  r7 := r7 << 2 + S2
        r7 := load(r7);  r8 := r5 >> 8 & 0xFF
        r8 := r8 << 2 + S3;  r9 := r9 << 2 + S4
        r8 := load(r8);  r9 := load(r9)
        r10 := r6 + r7
        r10 := r10 xor r8
        r10 := r10 - r9
        A := A xor r10

        r2 := A + r0;  r1 := rkp[7]
        r3 := r2 << 32;  r4 := r2 & 0xFFFFFFFF
        r5 := r3 | r4;  r0 := rkp[8]
        r5 := r5 >> r1
        r6 := r5 >> 24 & 0xFF;  r9 := r5 & 0xFF
        r6 := r6 << 2 + S1;  r7 := r5 >> 16 & 0xFF
        r6 := load(r6);  r7 := r7 << 2 + S2
        r7 := load(r7);  r8 := r5 >> 8 & 0xFF
        r8 := r8 << 2 + S3;  r9 := r9 << 2 + S4
        r8 := load(r8);  r9 := load(r9)
        r10 := r6 xor r7
        r10 := r10 - r8
        r10 := r10 + r9
        D := D xor r10

The second round is faster than the others, because the f function
begins and ends with xor operations, which commute with the xor
operations surrounding the f function.

A forward quad-round takes 54 cycles.  So the total time for the
cipher is 6 * 54 + 6 * 46 = 600.  To get the average time per round,
we divide this by 48.

------------------------------


** FOR YOUR REFERENCE **

The service address, to which questions about the list itself and requests
to be added to or deleted from it should be directed, is:

    Internet: [EMAIL PROTECTED]

You can send mail to the entire list (and sci.crypt) via:

    Internet: [EMAIL PROTECTED]

End of Cryptography-Digest Digest
******************************
Cryptography-Digest Digest #793

Reply via email to