sascha wrote: > Regarding CPU implementations: > > I have attached a 64 wide SIMD implementation (single instruction 64 > bits of data) > that does 8 million a5/1 rounds per second on a single core > of a 2ghz core2duo. I am not sure whether -O3 uses SSE automagically, > so i assume not. > I use 64 64bit wide memory locations (so effectively L1 cache) to > store a51 registers vertically, and so i can generate a keystream bit > for 64 independent a5/1 chains with a single set of instructions: > > (r1<18>(regs) ^ r2<21>(regs) ^ r3<22>(regs)); > > so r1<18>(regs) points to a 64bit memory location that stores bit 18 > of R1 for 64 A5/1 engines. > > the performance bottleneck is the shifting of R1,R2,R3, since you have > to touch each register. > > I like this idea (it would be great to make a practical implementation) ... without giving to much thought about it, it seems that the shifting can be avoided by unrolling the loop, and using the 64 registers as a ring-buffer. Hurray: more templates :-)
At the end of 64 iterations, you could then <logical or> the 15 lsb-registers together, and inspect the result for 0 bits, (distinguising points) - The round functions can be held in another set of 64 bits registers/memory holding <transposed round functions> and the relevant round function(s) can be updated across the registers. Frank _______________________________________________ A51 mailing list [email protected] http://lists.lists.reflextor.com/cgi-bin/mailman/listinfo/a51
