sascha wrote:
> Regarding CPU implementations:
>
> I have attached a 64 wide SIMD implementation (single instruction 64
> bits of data)
> that does 8 million a5/1 rounds per second on a single core
> of a 2ghz core2duo. I am not sure whether -O3 uses SSE automagically,
> so i assume not.
> I use 64 64bit wide memory locations (so effectively L1 cache) to
> store a51 registers vertically, and so i can generate a keystream bit
> for 64 independent a5/1 chains with a single set of instructions:
>
> (r1<18>(regs) ^ r2<21>(regs) ^ r3<22>(regs));
>
> so r1<18>(regs) points to a 64bit memory location that stores bit 18
> of R1 for 64 A5/1 engines.
>
> the performance bottleneck is the shifting of R1,R2,R3, since you have
> to touch each register.
>
>   
I like this idea (it would be great to make a practical implementation) 
... without giving to much thought about it, it seems that the shifting 
can be avoided by unrolling the loop, and using the 64 registers as a 
ring-buffer. Hurray: more templates :-)

At the end of 64 iterations, you could then <logical or> the 15 
lsb-registers together, and inspect the result for 0 bits, 
(distinguising points) - The round functions can be held in another set 
of 64 bits registers/memory holding <transposed round functions> and the 
relevant round function(s) can be updated across the registers.

Frank


_______________________________________________
A51 mailing list
[email protected]
http://lists.lists.reflextor.com/cgi-bin/mailman/listinfo/a51

Reply via email to