On 1/6/10 9:27 PM, sascha wrote: > On Wed, Jan 06, 2010 at 04:04:37PM +0100, M vd S wrote: > >> If so, were can I find it? >> > > on my hard disk. > > ah. great.
>> If not, would it be an idea to at least prop up a working (slow) C >> version (suitable for whatever architecture), which could be made >> SSE-aware in a next stage? SSE might be relatively slow, but >> multicpu multicore Xeons are abundant nowadays so it would allow a >> larger share of the community to join in and still contribute >> significantly. The chain generating code must be available anyway to >> people not owning graphics cards, in order to look up a chain. >> > > SSE is a viable target for the table lookup. but is outperformed by > the cuda client by more than an order of magnitude in both speed and > operations/watt, so it should not be used for table generation. > x86 is slow for bit-operations, true. But it's L2 cache is huge (8mb/cpu here). And I think the number of Xeons in the world is orders of magnitude bigger than the number of GPU's. Ops/watt are not our biggest concern. Most bit-operations can be replaced with smart lookup tables. The A5/1 LFSR's are quite small enough to allow useful tables to fit in the L2 cache. Mapping 20 bits of input to 16 bits of information, for example, consumes only 2 mb. Has this been considered in the "order of magnitude" difference between the platforms? As a simple example, consider predicting the next 6 shifts. One would use the clock bits and their 5 neighbors - that's 18 bits of index. The table will then give 12 bits indicating the next 6 shift patterns (of which there are only 4 combinations, hence 2 bits per shift). This would cost only 1<<19 bytes or 0.5 mb. A further table could predict 3 bits of keystream, based on 6 of these 12 bits and 9 leftmost lfsr bits. That would cost 1<<15 bytes or 32kb. (or predict 4 bits, that would cost 1<<20 or 1 mb) For each lfsr, a table could map the xorrable bits and their neighbors to the feedback bit pattern. For lfsr 1 this could be the leftmost 8 bits of data + 6 bits of shift pattern to get the next 3 bits, or 9 bits of data + 8 bits of pattern to get the next 4 bits. (17 bits in total) For lfsr 2 you could predict all next 6 bits with a 7+12=19 bit index. For lfsr 3 you would take bits 22-18 and 7-5 (8 in total) and 6 bits of shift pattern to get the next 3 bits. Add 4 bits for the next 4 bits. (18 bits in total) This setup would remain well within the limits of my L2 cache. The tables are generic in the sense that they have to be generated/loaded only once at startup. Of course some finetuning will speed things up even more, and of course the resulting code will not win the beauty pageant - but who cares. Some tables can be combined (at the cost of a shift operation) since only 4 bits per addressable byte are used. The first table mentioned could also predict the next 8 shifts, 32 mb in size, but would then reside in main memory. This would make all other tables 4-bit multiples while staying well within L2 cache limits. Other things to consider are things like the optimum alignment of the lfsr's. Maybe making these tables even bigger results in faster running code (using "slow" memory but processing more bits at once). The biggest lfsr state space is only 23 bits or 8M. This means I can map any state to any of the next 8 states with one lookup in a 192 mb table. That leaves me with 23.8 Gb for other tables. I cannot know how the current SSE implementation works, so maybe all this was considered already. But if not, this could possibly speed things up by at least a factor of 8. These are just some thoughts. I think a reference implementation to work on could spur the C optimizing contest. Kind regards, M. _______________________________________________ A51 mailing list [email protected] http://lists.lists.reflextor.com/cgi-bin/mailman/listinfo/a51
