Cryptography-Digest Digest #793, Volume #8 Wed, 23 Dec 98 18:13:02 EST
Contents:
AES candidate performance on the Alpha 21164 processor (version 1) (Kenneth Almquist)
----------------------------------------------------------------------------
From: [EMAIL PROTECTED] (Kenneth Almquist)
Subject: AES candidate performance on the Alpha 21164 processor (version 1)
Date: 23 Dec 1998 21:40:58 GMT
The Alpha 21164 processor is an interesting platform for comparison
of AES candidates. It is one of the most advanced microprocessors
available today, and thus provides some indication of what midrange
processors will look like a few years from now. In particular, it
is a 64 bit processor. Since Intel is moving to a 64 bit architecture
(Merced), it seems safe to assume that within a few years, 64 bit
processors will be the norm for personal computers. It would be
even more interesting to know how the candidates will perform on
the Merced, but Intel hasn't released the specifications of the
Merced yet.
1. The Numbers
I've computed some timing numbers for the Alpha 21164 processor, using
the documentation available from the digital.com web site. To get
these numbers, I coded the body of a single round, multiplied by the
number of rounds, and added the extra cost of whitening (if used).
algorithm cycles rounds * cycles + extra
-----------------------------------------------
DFC 304 8 38 0
Rijndael-128 340 10 34 0
Twofish 368 16 22.5 8
HPC 376? 8 47 0
Crypton 408 12 34 0
Rijndael-192 408 12 34 0
RC6 467 20 23 7
Rijndael-256 476 14 34 0
CAST-256 600 48 12.5 0
Serpent 947? 31 30 17
These numbers are optimistic because they don't include overhead
such as loading the data to be encrypted into registers. They also
don't include cache misses. (The data cache on the 21164 contains
8K bytes, and all the algorithms require less.) The three entries
for Rijndael are for Rijndael with different key sizes.
The designers of two of the algorithms have provided numbers for
implementations on the Alpha:
algorithm cycles
---------------------
DFC: 428
HPC: 421
In both cases, these numbers are significantly higher than mine. I
would expect them to be larger because my numbers are optimistic, but
the differences are large enough that I assume the reported numbers
are either for the Alpha 21064 processor (which is slower than the
21164 processor) or are due to less efficient coding.
I apologize if I have omitted your favorite algorithm. I gave
preference to algorithms which are freely usable, seem to have a
chance of being selected as the AES standard, and/or can use the
features of a 64 bit architecture.
2. Comments on the Algorithms
The Alpha 21164 can execute two integer instructions in parallel.
Crypton, Rijndael, Twofish, and (probably) Serpent all have sufficient
parallelism to keep the processor occupied. DFC, HPC, and RC6 do not.
Here are my comments on the individual algorithms:
Decorrelated Fast Cipher (DFC), which is relatively slow on 32 bit
machines, lives up to its name when run on the Alpha. The round
function is conceptually simple, but includes a 64 bit multiply and
a modulo operation with the modulus 2**64+13. These operations
are slow, even on the Alpha. DFC comes out on top because the
number of rounds is small. (Note that DFC is provably secure
against many attacks, so it is not necessarily less secure than
competitors with more rounds.)
Twofish has a relatively complex round function, but it has enough
parallelism that two instructions can be executed every cycle,
leading to the smallest number of cycles per round.
Rijndael and Crypton have the same structure, differing only in
the tables used. Each round modifies the entire block, so the
number of table lookups per round is twice that of Twofish.
Otherwise, the round function is quite simple. The round
function has enough parallelism that two instructions can be
executed every cycle. With 128 bit keys, Rijndeal uses 10 rounds,
and comes out slightly ahead of Twofish. Crypton uses 12 rounds,
and Rijndeal uses 12 and 14 rounds with 192 and 256 bit keys,
respectively, so these are slower than Twofish.
The 128 bit block version of Hasty Pudding Cypher (HPC) has only eight
rounds, but each round can be thought of as 8 Festial rounds, so the
cipher can also be views as consisting of 64 quick Festial rounds.
The round function is such a mind-numbing mixture of operations
shifts, adds, subtracts, xors, and table lookups that I've put a
question mark next to my cycle count. The main problem with the round
function is that it has little parallelism. My guess is that the
cipher was developed for the Alpha 21064 chip, which I believe could
execute only one instruction per cycle. With HPC, the second pipeline
on the 21164 processor is mostly idle. The 512 bit block variant of
HPC appears to have more parallelism, and thus may be a better match
for the 21164 processor.
(As an aside, I got a chuckle out of the HPC AES submission. Commenting
on the difficulty of implementing HPC on smart cards, which generally
contain 128 bytes (at most 256 bytes) of RAM, the submission deadpans:
"The minimum memory requirement of 2KB may be a challenge".)
RC6 uses 32 bit multiplication, and has the fastest encryption speed
of all the AES candidates when run on the Pentium Pro, which has a
fast hardware multiplier. The 21164 also has a fast 32 bit
multiplier. However, RC6 also performs four rotates per round (two of
them variable), and these are not particularly efficient on 64 bit
machines. The round function would be two instructions (one cycle)
shorter if the variable rotations rotated right rather than left. The
same difference in instruction count is likely to occur on a "typical"
64 bit machine. There seems to be no cryptographic reason for RC6 the
variable rotations to left rather than rotate right, so the choice of
rotation direction appears to indicate that performance on 64 bit
machines was not a high priority for the RC6 designers.
CAST-128 has the fastest round function, but each round only modifies
1/4 of the block. So it takes four rounds, or 50 cycles, to modify
every bit in a block. This is slower than any of the other ciphers
except DFC. CAST-128 has a reasonable amount of parallelism. It runs
about 8.7% slower than it would if it could be scheduled to execute
two instructions every cycle.
Serpent is by far the slowest of the algorithms considered here. The
value given in the table is an estimate because I have not worked out
the implementations of the S-boxes. My guess is that the estimate is
on the high side, with the cost per round being closer to 29.5 than to
30. The most time consuming part of the cipher is the linear mixing,
which includes four 32 bit rotates.
3. The Implementations
In this section, I provide the implementations used to calculate the
performance numbers given in section 1. This section is rather long,
so many readers will want to stop here, but it seems only fair to
include the basis for my comparison so that others can check my work
if they desire.
Rather than giving Alpha instructions, I present the implementations
in pseudocode, where each pseudocode statement corresponds to one
Alpha instruction. If two instructions can be executed in on cycle, I
put them on the same line, so each line can be executed in a single
cycle. Idle processor cycles are indicated by lines containing the
word "wait."
When a cipher uses round keys, I assume that a pointer named rkp
points to the round key array. I give the subscripts for the first
iteration of the loop. If it doesn't affect performance, I adjust
rkp to point to the next set of round keys. Otherwise, subsequent
iterations use different subscript values. (That is not a problem
because I assume that the loops are fully unrolled, meaning that
the constants can be different in each iteration).
Most Alpha instructions can only take 0 through 255 as the value of a
constant operand. The exception is in address calculation, which adds
a 16 bit offset to the value of a register. Other constants used in
the code have to be preloaded into registers. In the pseudocode, I
simply use constants without indicating whether they are encoded in
the instruction or taken from registers, since the execution time is
the same either way (except for the cost of loading the registers
prior to encryption, which is small).
3.1. RC6
The 32 bit rotates are computed by concatenating the 32 bit value
with itself to form a 64 bit value, and then shifting right. A rotate
right of X by N bits is
(X << 32 | X & 0xFFFFFFFF) >> N
The rotate operator used in the RC6 specification rotates left, and
masks the shift count, so it is implemented as
(X << 32 | X & 0xFFFFFFFF) >> (32 - (N & 0x1F))
The 32 bit multiply operation takes 8 cycles, and the second 32 bit
multiply must start 4 cycles after the first. Also, the operands
to the multiply instruction must be calculated two cycles in advance.
Because one multiply instruction must start later than the other, I
calculate
B * (2 * B + 1)
as
B * B * 2 + B
(Credit goes to Doug Whiting for this optimization.) Otherwise, the
code is straightforward.
The instruction schedule below executes at the rate of one cycle
per line. It consists of 26 lines, but the last three lines can
overlap the first three, so in an unrolled loop each round takes
23 cycles.
r2 := D + D
r2 := r2 + 1
r4 := B * B
r0 := rkp[0]; r1 := rkp[1]
rkp := rkp + 8
wait
r5 := D * r2
wait
wait
wait
r4 := r4 + r4
r4 := r4 + B
r2 := r4 << 32; r3 := r4 & 0xFFFFFFFF
r4 := r2 | r3
r2 := r5 << 32; r3 := r5 & 0xFFFFFFFF
r5 := r2 | r3; t := r3 >> 27
u := r4 >> 27; A := A xor t
r8 := u & 0x1F; r7 := t & 0x1F
r10 := A << 32; r9 := A & 0xFFFFFFFF
A := r9 | r10; r8 := 32 - r8
A := A >> r8; C := C xor r4
r7 := 32 - r7; A := A + r0 -- r0 contains S[2*i]
r1 := C << 32; r2 := C & 0xFFFFFFFF
C := r1 | r2
C := C >> r7
C := C + r1 -- r1 contains S[2*i+1]
The initial whitening operation is implemented as
r0 := rkp[0]; r1 := rkp[1]
rkp := rkp + 8
B := B xor r0; D := D xor r1
The final whitening operation is
C := r1 | r2; r2 := rkp[2]
C := C >> r7; r3 := rkp[3]
C := C + r1; A := A xor r2
C := C xor r3
This includes the last operations of the final round, which would
otherwise not be accounted for.
3.2. DFC
We must perform two multiply operations because an Alpha instruction
can only produce a 64 bit output, and we want the 128 bit product.
The Alpha instruction to produce the high order bits takes 14 clock
cycles, while the one to produce the low order bits takes only 12
cycles, so we perform them in that order. The second multiply
operation must start 8 cycles after the first.
The first round key is needed right at the start of the iteration,
so we prefetch it into r0 during the preceding round. (This means
that before the first round, we preload r0 with the first round key.)
The modulo 2*64+13 operation first uses the reduction suggested in
the DFC submission:
(Q * 2**64 + R) modulo 2**64+13
= (13 * ((2**64 - 1) - Q) + 182 + R) modulo 2**64+31
This gives us a value less than 15 * 2**64, which we then reduce using
(Q' * 2**64 + R') modulo 2**64+13 = (R' - 13 * Q') modulo 2**64+13
Since the Alpha doesn't have a carry bit, we split numbers into 32
bit chunks before performing most operations. The exceptions are the
multiply operation and the computation of ((2**64 - 1) - Q). The
latter computation cannot overflow, and is performed by taking the
bitwise complement of Q (which I write as ~Q).
Multiplication of a number X by 13 is done using the formula
X * 13 = ((X * 4) + ((X * 8) + X))
which can be evaluated in two instructions on the Alpha.
When performing the second reduction, where I subtract 13*Q' from
R', the result could be negative, either because I'm working with
only the low 32 bits of R', and I need to do a borrow operation,
or because R' is actually less than 13*Q'. My understanding is
that the low 32 bits of R' should behave like a uniformly distributed
random number. This being the case, and since Q' <= 14, the
probability of the subtraction yielding a negative value is so
small that I ignore the cost of handling this condition in my
cycle count, although I do of course include the cost of testing
for it. The code to handle this case appears at the end of the
section.
The round table (RT) is specified to be a table of 32 bit values,
but it is xored with the high order bits, so we implement it as
a table of 64 bit values, with the left shift 32 operations
precalculated.
r1 := rkp[1]
wait
r2 := high_bits(A * r0)
wait
wait
wait
wait
wait
wait
wait
r3 := A * r0; rkp := rkp + 16
r0 := rkp[0]
r14 := r1 >> 32; r15 := r1 & 0xFFFFFFFF
wait
wait
wait
r2 := ~r2
r4 := r2 >> 32; r5 := r2 & 0xFFFFFFFF
r6 := (r4 << 3) + r4; r7 := (r5 << 3) + r5
r6 := (r4 << 2) + r6; r7 := (r5 << 2) + r7
wait
wait
r8 := r3 >> 32; r9 := r3 & 0xFFFFFFFF
r8 := r8 + r14; r9 := r9 + r15
r6 := r6 + r8; r7 := r7 + r9
r7 := r7 + 182
r10 := r7 >> 32; r7 := r7 & 0xFFFFFFFF
r6 := r6 + r10
r10 := r6 >> 32; r6 := r6 & 0xFFFFFFFF
r11 := (r10 << 3) + r10; r12 := r6 & 0x3F
r11 := (r10 << 2) + r11; r12 := (r12 << 3) + RT
r7 := r7 - r11
if r7 < 0 goto L3; r13 := r7 << 32
L1: r12 := load(r12); r13 := r13 | r6
r13 := r13 xor KC
r13 := r13 xor r12
r13 := r13 + KD
B := B xor r13
-- iteration ends here --
L2: r12 := r6 & 0x3F; r13 := r7 << 32
r12 := (r12 << 3) + RT; goto L1
L3: r7 := r7 + (1 << 32); r6 := r6 - 1
if r6 >= 0 goto L2
r7 := r7 + 13
if r7 <= 0xFFFFFFFF goto L2
r7 := r7 & 0xFFFFFFFF
r6 +:= 1
goto L2
3.3. HPC
HPC straightforward but tedious to schedule for the Alpha. The value
of i is the round number. I have coded this assuming that the loop
is fully unrolled, meaning that i is a constant. Making i a variable
would increase the number of instructions, but not the execution time
since there are lots of cycles where currently only one instruction is
executing.
r0 := s0 & 0xFF
r0 := (r0 << 3) + KX
k := load(r0)
r2 := spice[i xor 4]
s1 := s1 + k; r3 := k << 8
s0 := s0 xor r3; r4 := s1 >> 11
s0 := s0 - r4; r5 := s1 << 2
s0 := s0 xor r5
s0 := s0 - r2
r6 := s0 << 32
r6 := r6 xor (PI19 + 128)
s0 := s0 + r6
r7 := s0 >> 17
s0 := s0 + r7
r8 := s0 >> 34; t := spice[i]
s0 := s0 + r8
s0 := s0 xor t; r10 := t << 5
s0 := s0 + r10; t := t >> 4
s1 := s1 + t; s0 := s0 xor t
r12 := s0 & 0x1F
r12 := r12 + 22
r13 := s0 << r12
s0 := s0 + r13
r14 := s0 >> 23; r15 := spice[i xor 7]
s0 := s0 xor r14
s0 := s0 - r15
t := s0 & 0xFF
t := (t << 8) + KX
k := load(t); t := t + (8 * 3*i+1)
kk := load(t)
s1 := s1 xor k
r18 := kk >> 8; kk := kk xor k
s0 := s0 xor r18; r19 := kk >> 5
s1 := s1 + r19; r20 := kk << 12
s0 := s0 - r20; r21 := kk & ~0xFF
s0 := s0 xor r21
s1 := s1 + s0
r22 := s1 << 3; r23 := spice[i xor 2]
s0 := s0 + r22
s0 := s0 xor r23; r24 := KX[i + 128 + 16]
s0 := s0 + r24
r25 := s0 << 22; r27 := spice[i xor 1]
s0 := s0 + r25; r26 := s1 >> 4
s0 := s0 xor r26
s0 := s0 + r27
r29 := s0 >> (i + 33)
s0 := s0 xor r29
3.4. Twofish
Because Twofish has plenty of potential parallelism, what I give here
is simply a collection of instructions. Scheduling these to execute
at a rate of two per clock cycle is straightforward. I list them this
way so I can group related instructions together.
The data to be encrypted is stored in four registers called x0 through
x3. The only tricky part of this code is the rotate operations. The
rotate of x1 by 8 bits is performed as part of the operation where we
split x1 into 8 bit quantities to feed into the s-boxes. The rotates
by one bit are implemented as
rotate_left(X, 1) = (X << 1) - (X >> 31)
The shift of 31 arranges for us to subtract -1 (i.e. add positive 1)
if bit 31 of X is 1. For this to work on a 64 bit machine, the high
order bits of X must contain the sign extension of bit 32 of X. The
Alpha has 32 bit add and subtract operations, which calculate the
result, truncate to 32 bits, and then sign extend the truncated result
to 64 bits. All addition and subtraction operations in the Twofish
code should be interpreted as these 32 bit operations. Without this
special feature of the Alpha architecture, we would expect each of the
rotate operations to take an additional instruction, adding two to the
instruction count and one to the number of cycles per round.
I've named the tables sb0 through sb3, and stored them in a structure
pointed to by expkey.
r0 := x0 & 0xFF
r0 := (r0 << 2) + expkey
r0 := load(r0 + offset_of(sb0))
r1 := x0 >> 8 & 0xFF
r1 := (r1 << 2) + expkey
r1 := load(r1 + offset_of(sb1))
r2 := x0 >> 16 & 0xFF
r2 := (r2 << 2) + expkey
r2 := load(r2 + offset_of(sb2))
r3 := x0 >> 24 & 0xFF
r3 := (r3 << 2) + expkey
r3 := load(r3 + offset_of(sb3))
r4 := r0 xor r1
r5 := r2 xor r3
t0 := r4 xor r5
r6 := x1 >> 8 & 0xFF
r6 := (r6 << 2) + expkey
r6 := load(r6 + offset_of(sb0))
r7 := x1 >> 16 & 0xFF
r7 := (r7 << 2) + expkey
r7 := load(r7 + offset_of(sb1))
r8 := x1 >> 24 & 0xFF
r8 := (r8 << 2) + expkey
r8 := load(r8 + offset_of(sb2))
r9 := x1 & 0xFF
r9 := (r9 << 2) + expkey
r9 := load(r9 + offset_of(sb3))
r10 := r6 xor r7
r11 := r8 xor r9
t1 := r10 xor r11
r12 := x3 << 1
r13 := x3 >> 31
x3 := r12 - r13
r14 := rkp[0]
r15 := t0 + t1
r15 := r15 + r14
x2 := x2 xor r15
r16 := rkp[1]
r17 := t1 + t1
r17 := r17 + t0
r17 := r17 + r16
x3 := x3 xor r17
r18 := x2 << 1
r19 := x2 >> 31
x2 := r18 - r19
3.5. Rijndael and Crypton
As with Twofish, each table lookup takes three instructions:
r0 := x0 & 0xFF
r0 := (r0 << 2) + expkey
r0 := load(r0 + offset_of(sb0))
I won't bother to write out all 48 instructions used to implement the
16 table lookups. After placing the table lookup results into r0
through r15, the remaining instructions are:
x0 := rkp[0]
x0 := x0 xor r0
x0 := x0 xor r1
x0 := x0 xor r2
x0 := x0 xor r3
x1 := rkp[1]
x1 := x1 xor r4
x1 := x1 xor r5
x1 := x1 xor r6
x1 := x1 xor r7
x2 := rkp[2]
x2 := x2 xor r8
x2 := x2 xor r9
x2 := x2 xor r10
x2 := x2 xor r11
x3 := rkp[3]
x3 := x3 xor r12
x3 := x3 xor r13
x3 := x3 xor r14
x3 := x3 xor r15
These instructions are easily scheduled to execute at the rate of two
per cycle on the 21164.
3.6. Serpent
The execution time for Serpent on the 21164 is an estimate because I
haven't worked out the S-boxes. Other than that, the impleemntation
is straightforward. The linear mixing phase of Serpent uses 32 bit
rotate operations, which are implemented as
X <<< N = (X << 32 | X & 0xFFFFFFFF) >> (32 - N)
The data to be encrypted is stored as 32 bit quantities in four
registers named A, B, C, and D.
The xor of the data with the round key is implemented as follows:
r0 := rkp[0]; r1 := rkp[1]
r2 := rkp[2]; r3 := rkp[3]
A := a xor r0; B := B xor r1
C := C xor r2; D := D xor r3
The Serpent s-box operations are conceptually simple, but a fast
implementation has to covert each of the S-box implemenations to
boolean operations. This takes an average of about 17 instructions,
which have a fair amount of parallelism. This takes 8.5 cycles
assuming that two instructions can be executed every cycle. I will
round up (to 9 cycles), since that will give the right answer even
if one of the instructions cannot be paired.
The linear mixing operation involves a bunch of rotate operations,
which we implement using the technique described in the implemetation
of RC6.
r0 := A << 32; r1 := A & 0xFFFFFFFF
r2 := r0 | r1; r3 := C << 32
A := A >> (32 - 13); r4 := C & 0xFFFFFFFF
r5 := r3 | r4; r6 := A << 3
C := r5 >> (32 - 3); D := D xor r6
D := D xor C; B := B xor A
r7 := D << 32; r8 := D & 0xFFFFFFFF
r9 := r7 xor R8; B := B xor C
r10 := B << 32; r11 := B & 0xFFFFFFFF
r12 := r10 | r11; B := B xor A
D := r12 >> (32 - 1); A := A xor B
A := A xor D; r13 := B << 7
C := C xor D; r14 := A << 32
C := C xor r13; r15 := A & 0xFFFFFFFF
r17 := C << 23; r18 := C & 0xFFFFFFFF
r16 := r14 | r15; r19 := r17 | r18
A := r16 >> (32 - 5)
C := r19 >> (22 - 5)
This is 34 instructions, which I will count as 17 cycles. (The
latency is 18 cycles, but we can execute two load instructions
from the "xor with key" operation during the last two cycles.)
The time per round is therefore
4 cycles for xor with key
9 cycles for sbox operations (estimated)
17 cycles for mixing operations
30 cycles total (estimated)
The last round replaces the mixing operations with another xor with
key, giving a time of 17 cycles for the last round. In the table I
list 31 regular rounds and an additional time of 17 clocks to account
for the last round.
3.7. CAST-256
The specification uses a 32 bit rotate left operation. Since the
amount of rotation is controlled by the key, we assume that the round
key array has been preprocessed to convert the rotate left amounts into
rotote right amounts. We implement rotate right as
rotate_right(X, N) = (X << 32 | X & 0xFFFFFFFF) >> N
The round function uses the other round key right at the start, so
we preload this round key into r0 as part of the preceding round.
In the reverse quad-rounds, a single round can be executed with the
following instructions. Since three rounds can be executed in
parallel, these operations can be fully parallelized, resulting in
a time per quad-round of 46 cycles.
r1 := rkp[1]
r2 := A + r0
r0 := rkp[2]
r3 := r2 << 32
r4 := r2 & 0xFFFFFFFF
r5 := r3 | r4
r5 := r5 >> r1
r6 := r5 >> 24 & 0xFF
r6 := r6 << 2 + S1
r6 := load(r6)
r7 := r7 << 2 + S2
r7 := r5 >> 16 & 0xFF
r7 := load(r7)
r8 := r5 >> 8 & 0xFF
r8 := r8 << 2 + S3
r8 := load(r8)
r9 := r5 & 0xFF
r9 := r9 << 2 + S4
r9 := load(r9)
r10 := r6 xor r7
r10 := r10 - r8
r10 := r10 + r9
D := D xor r10
In the forward quad-rounds, the rounds must be performed serially. The
execution schedule for a forward quad-round is:
r2 := D + r0; r1 := rkp[1]
r3 := r2 << 32; r4 := r2 & 0xFFFFFFFF
r5 := r3 | r4; r0 := rkp[2]
r5 := r5 >> r1
r6 := r5 >> 24 & 0xFF; r9 := r5 & 0xFF
r6 := r6 << 2 + S1; r7 := r5 >> 16 & 0xFF
r6 := load(r6); r7 := r7 << 2 + S2
r7 := load(r7); r8 := r5 >> 8 & 0xFF
r8 := r8 << 2 + S3; r9 := r9 << 2 + S4
r8 := load(r8); r9 := load(r9)
r10 := r6 xor r7
r10 := r10 - r8
r10 := r10 + r9; C := C xor r0
C := C xor r10; r1 := rkp[3]
r3 := r2 << 32; r4 := r2 & 0xFFFFFFFF
r5 := r3 | r4; r0 := rkp[4]
r5 := r5 >> r1
r6 := r5 >> 24 & 0xFF; r9 := r5 & 0xFF
r6 := r6 << 2 + S1; r7 := r5 >> 16 & 0xFF
r6 := load(r6); r7 := r7 << 2 + S2
r7 := load(r7); r8 := r5 >> 8 & 0xFF
r8 := r8 << 2 + S3; r9 := r9 << 2 + S4
r8 := load(r8); r9 := load(r9)
r10 := r6 - r7
r10 := r10 + r8; B := B xor r9
B := B xor r10
r2 := B - r0; r1 := rkp[5]
r3 := r2 << 32; r4 := r2 & 0xFFFFFFFF
r5 := r3 | r4; r0 := rkp[6]
r5 := r5 >> r1
r6 := r5 >> 24 & 0xFF; r9 := r5 & 0xFF
r6 := r6 << 2 + S1; r7 := r5 >> 16 & 0xFF
r6 := load(r6); r7 := r7 << 2 + S2
r7 := load(r7); r8 := r5 >> 8 & 0xFF
r8 := r8 << 2 + S3; r9 := r9 << 2 + S4
r8 := load(r8); r9 := load(r9)
r10 := r6 + r7
r10 := r10 xor r8
r10 := r10 - r9
A := A xor r10
r2 := A + r0; r1 := rkp[7]
r3 := r2 << 32; r4 := r2 & 0xFFFFFFFF
r5 := r3 | r4; r0 := rkp[8]
r5 := r5 >> r1
r6 := r5 >> 24 & 0xFF; r9 := r5 & 0xFF
r6 := r6 << 2 + S1; r7 := r5 >> 16 & 0xFF
r6 := load(r6); r7 := r7 << 2 + S2
r7 := load(r7); r8 := r5 >> 8 & 0xFF
r8 := r8 << 2 + S3; r9 := r9 << 2 + S4
r8 := load(r8); r9 := load(r9)
r10 := r6 xor r7
r10 := r10 - r8
r10 := r10 + r9
D := D xor r10
The second round is faster than the others, because the f function
begins and ends with xor operations, which commute with the xor
operations surrounding the f function.
A forward quad-round takes 54 cycles. So the total time for the
cipher is 6 * 54 + 6 * 46 = 600. To get the average time per round,
we divide this by 48.
------------------------------
** FOR YOUR REFERENCE **
The service address, to which questions about the list itself and requests
to be added to or deleted from it should be directed, is:
Internet: [EMAIL PROTECTED]
You can send mail to the entire list (and sci.crypt) via:
Internet: [EMAIL PROTECTED]
End of Cryptography-Digest Digest
******************************