[A51] x86 / SSE

M vd S Thu, 07 Jan 2010 05:55:54 -0800

On 1/6/10 9:27 PM, sascha wrote:
> On Wed, Jan 06, 2010 at 04:04:37PM +0100, M vd S wrote:
>   
>> If so, were can I find it?
>>     
>
> on my hard disk.
>
>   
ah. great.


>> If not, would it be an idea to at least prop up a working (slow) C
>> version (suitable for whatever architecture), which could be made
>> SSE-aware in a next stage? SSE might be relatively slow, but
>> multicpu multicore Xeons are abundant nowadays so it would allow a
>> larger share of the community to join in and still contribute
>> significantly. The chain generating code must be available anyway to
>> people not owning graphics cards, in order to look up a chain.
>>     
>
> SSE is a viable target for the table lookup. but is outperformed by
> the cuda client by more than an order of magnitude in both speed and
> operations/watt, so it should not be used for table generation.
>   

x86 is slow for bit-operations, true. But it's L2 cache is huge (8mb/cpu 
here). And I think the number of Xeons in the world is orders of 
magnitude bigger than the number of GPU's. Ops/watt are not our biggest 
concern.

Most bit-operations can be replaced with smart lookup tables. The A5/1 
LFSR's are quite small enough to allow useful tables to fit in the L2 
cache. Mapping 20 bits of input to 16 bits of information, for example, 
consumes only 2 mb.

Has this been considered in the "order of magnitude" difference between 
the platforms?


As a simple example, consider predicting the next 6 shifts. One would 
use the clock bits and their 5 neighbors - that's 18 bits of index. The 
table will then give 12 bits indicating the next 6 shift patterns (of 
which there are only 4 combinations, hence 2 bits per shift). This would 
cost only 1<<19 bytes or 0.5 mb.

A further table could predict 3 bits of keystream, based on 6 of these 
12 bits and 9 leftmost lfsr bits. That would cost 1<<15 bytes or 32kb. 
(or predict 4 bits, that would cost 1<<20 or 1 mb)

For each lfsr, a table could map the xorrable bits and their neighbors 
to the feedback bit pattern. For lfsr 1 this could be the leftmost 8 
bits of data + 6 bits of shift pattern to get the next 3 bits, or 9 bits 
of data + 8 bits of pattern to get the next 4 bits. (17 bits in total)
For lfsr 2 you could predict all next 6 bits with a 7+12=19 bit index.
For lfsr 3 you would take bits 22-18 and 7-5 (8 in total) and 6 bits of 
shift pattern to get the next 3 bits. Add 4 bits for the next 4 bits. 
(18 bits in total)

This setup would remain well within the limits of my L2 cache. The 
tables are generic in the sense that they have to be generated/loaded 
only once at startup.

Of course some finetuning will speed things up even more, and of course 
the resulting code will not win the beauty pageant - but who cares. Some 
tables can be combined (at the cost of a shift operation) since only 4 
bits per addressable byte are used. The first table mentioned could also 
predict the next 8 shifts, 32 mb in size, but would then reside in main 
memory. This would make all other tables 4-bit multiples while staying 
well within L2 cache limits. Other things to consider are things like 
the optimum alignment of the lfsr's.


Maybe making these tables even bigger results in faster running code 
(using "slow" memory but processing more bits at once). The biggest lfsr 
state space is only 23 bits or 8M. This means I can map any state to any 
of the next 8 states with one lookup in a 192 mb table. That leaves me 
with 23.8 Gb for other tables.

I cannot know how the current SSE implementation works, so maybe all 
this was considered already. But if not, this could possibly speed 
things up by at least a factor of 8.


These are just some thoughts. I think a reference implementation to work 
on could spur the C optimizing contest.


Kind regards,
M.

_______________________________________________
A51 mailing list
[email protected]
http://lists.lists.reflextor.com/cgi-bin/mailman/listinfo/a51

[A51] x86 / SSE

Reply via email to