Re: [A51] A51 Digest, Vol 6, Issue 5

James Evans Wed, 04 Nov 2009 11:58:11 -0800

I have access to a cluster if you want me to run some tests.

On Wed, Nov 4, 2009 at 12:44 AM,  <[email protected]> wrote:
> Send A51 mailing list submissions to
>        [email protected]
>
> To subscribe or unsubscribe via the World Wide Web, visit
>        http://lists.lists.reflextor.com/cgi-bin/mailman/listinfo/a51
> or, via email, send a message with subject or body 'help' to
>        [email protected]
>
> You can reach the person managing the list at
>        [email protected]
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of A51 digest..."
>
>
> Today's Topics:
>
>   1. Re: error while cd obj ; make installing the program (sascha)
>   2. Re: GPU choice (sascha)
>   3. RT trade-offs  (Re:  Hosting the tables) (Karsten Nohl)
>   4. Re: RT trade-offs  (Re:  Hosting the tables) (sascha)
>   5. Re: RT trade-offs  (Re:  Hosting the tables) (Karsten Nohl)
>   6. Re: RT trade-offs  (Re:  Hosting the tables) (sascha)
>   7. Re: CPU implementation (more code) (sascha)
>   8. Re: CPU implementation (more code) (Sascha Krissler)
>   9. Re: Fast sorting & compact tables (Frank A. Stevenson)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Tue, 3 Nov 2009 15:14:25 +0100
> From: sascha <[email protected]>
> Subject: Re: [A51] error while cd obj ; make installing the program
> To: [email protected]
> Message-ID: <20091103141425.ga4...@test>
> Content-Type: text/plain; charset=us-ascii
>
> generator=increment is not enabled by default, so
> use
> generator=lfsr2::tablesize=32
>
> Please provide a gdb backtrace for the segmentation fault
>
> gdb --args ./c --options --and --more --options generator --foo --bar
>
> gdb> run
> [ Segfaults ]
> gdb> ba
>
> On Tue, Nov 03, 2009 at 01:49:22AM -0800, Sylv1 wrote:
>> Hello,
>> thanks for the help the make is now working correctly. I've just did a 
>> stupid error in the paths settings of the makefile...
>>
>> But i tried to run the test with:
>> time ./c --condition rounds:rounds=256 --implementation sharedmem
>> --algorithm A51 --roundfunc
>> xor:condition=distinguished_point::bits=8:generator=increment --device
>> cuda:threads=256:blocks=4:sleep=5000:operations=32768 --logger verbose
>> --work random:prefix=10,0 --consume print:results=16 generate
>> --chainlength 65536 --chains 1024
>>
>
>
> ------------------------------
>
> Message: 2
> Date: Tue, 3 Nov 2009 15:18:03 +0100
> From: sascha <[email protected]>
> Subject: Re: [A51] GPU choice
> To: [email protected]
> Message-ID: <20091103141803.gb4...@test>
> Content-Type: text/plain; charset=us-ascii
>
> the performance can be calculated as number of cores * core clock.
> i only have numbers for the 9600M-GT (20 chains/sec at 500mhz) and the
> GTX260 (140 chains/sec at 576 mhz). The 9600M-GT has 32 cores.
> So a 9600GT with 64 cores and a slightly higher clock should be able to
> do 42 chains/sec. (for example)
>
>
> On Tue, Nov 03, 2009 at 02:39:36AM -0800, Sylv1 wrote:
>> Hello!
>>
>> I also want to buy new harder to begin my computations. What is the best 
>> quality/price in the following set and wich card performs the best for the 
>> tables computation :
>> Nvidia GeForce 8400GS, GeForce 9600GT, GeForce 9400GT, GeForce 9500GT, 
>> GeForce 9800GX2 and GeForce GTX260.?
>>
>> Thanks
>> sylv1
>>
>> --- On Mon, 11/2/09, sascha <[email protected]> wrote:
>>
>> From: sascha <[email protected]>
>> Subject: Re: [A51] GPU choice
>> To: [email protected]
>> Date: Monday, November 2, 2009, 6:25 PM
>>
>> On Mon, Nov 02, 2009 at 05:46:13PM +0000, Jonathan wrote:
>> > Hello list!
>> > Been away for the last month with no internet access in the middle of 
>> > nowhere,
>> > sorry for absence.
>> >
>> > I am now in the process of taking some hardware...
>> > apart from Motherboard, cpu, etc, how does this GPU look?
>> >
>> > ENGTX295/2DI/1792MD3 - 1792MB Asus GTX 295 55nm, 1998MHz GDDR3, GPU
>> > 576 MHz, Shader 1242 MHz, 480 Cores
>>
>> looks good. performance per dollar is quite bad with a GTX295 though.
>> 2 GTX260 with 432 cores together cost 260Eur and a GTX295 with 480
>> cores costs 370Eur. the 295 only occupies 2 slots instead of 4 with
>> 2 GTX260 though.
>> >
>> > Would using 2 increase (double?) performance?
>>
>> scales linearly.
>> http://en.wikipedia.org/wiki/Embarrassingly_parallel
>>
>> >
>> > What countigous Disk size should I look into?
>>
>> a single table occupies around 6gb, you can expect to fill 12gb per month
>> with a GTX295. For the sorting you may need another 6 or 15gb of temp 
>> storage.
>> that depends whether we dump stxxl which is slow and hungry (in the way it
>> is used now) for our own sorter implementation.
>>
>> >
>> > Jon
>> >
>> _______________________________________________
>> A51 mailing list
>> [email protected]
>> http://lists.lists.reflextor.com/cgi-bin/mailman/listinfo/a51
>>
>>
>>
>>
>
>> _______________________________________________
>> A51 mailing list
>> [email protected]
>> http://lists.lists.reflextor.com/cgi-bin/mailman/listinfo/a51
>
>
>
> ------------------------------
>
> Message: 3
> Date: Tue, 3 Nov 2009 18:12:00 +0100
> From: Karsten Nohl <[email protected]>
> Subject: [A51] RT trade-offs  (Re:  Hosting the tables)
> To: [email protected]
> Message-ID: <[email protected]>
> Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes
>
>
> On Nov 3, 2009, at 4:09 PM, Jonathan wrote:
>
>> 2009/11/2 Karsten Nohl <[email protected]>:
>>>> I phrased the question wrong perhaps.
>>>> What is an IDEAL amount of storage for a complete set of tables
>>>> with a
>>>> decent probability?
>>>>
>>>> 99% or 96% being decent...
>>>
>>> The math works out as:
>>>
>>> 50% failure rate at <3TB
>>> Each time you half the failure rate, the storage doubles.
>>> 75% = 6TB; ... ; 99% = ~150 TB
>
> One correction: This calculation ignored collisions and was therefore
> even too optimistic.
>
>> I remember the THC tables that were (allegedly) computed
>> were more on the order of 2.2 TB for a 6% I think...
>> Seems like a large difference... Maybe there is room for improvement?
>
> The trade-off goes three ways:   storage space  vs.  attack time  vs.
> failure rate
>
> The THC tables -- as I understand them -- had larger coverage in
> smaller space at the cost of much increased attack time. For those
> tables to be useable with reasonable attack times you have to have an
> FPGA cluster.
>
> Let 2^D be the available storage (in # of chains), 2^M be the hash
> size, and 1/2^C be the coverage, then the attack time 2^T (in # of
> computations) is derived as:
>
> Distinguished Point Tables
> T = 2M - 3C - 2D + 1
> Read as:
>
> Rainbow Tables
> T = 2M - 2C - 2D - 2
>
> We use a mix of these two techniques but stay closer to the
> distinguished points scenario.
> Parameters for the tables being computed at the moment are:
> M = 64  (A5/1)
> D = 38  (4 TB)
> C = 6  (50% success rate)
> => T = 33.5 (2 minutes on a computer with GTX260 that has all tables;
> less when the distributed cracking network is used)
>
>
>
>
>
> ------------------------------
>
> Message: 4
> Date: Tue, 3 Nov 2009 18:48:44 +0100
> From: sascha <[email protected]>
> Subject: Re: [A51] RT trade-offs  (Re:  Hosting the tables)
> To: [email protected]
> Message-ID: <20091103174844.ga6...@test>
> Content-Type: text/plain; charset=us-ascii
>
> On Tue, Nov 03, 2009 at 06:12:00PM +0100, Karsten Nohl wrote:
>> We use a mix of these two techniques but stay closer to the
>> distinguished points scenario.
>> Parameters for the tables being computed at the moment are:
>> M = 64  (A5/1)
>> D = 38  (4 TB)
>> C = 6  (50% success rate)
>> => T = 33.5 (2 minutes on a computer with GTX260 that has all tables;
>> less when the distributed cracking network is used)
>>
> This number is not correct. For a lookup of a single value, you have to 
> compute
> (1 + 2 + 3 + ... + 32) * 2^15 A5/1 rounds, which is 528 * 2^15 = 17.3 million
> To do the lookup in all 400 tables, you need 7 billion rounds.
> 7 billion links can be computed in 42 seconds on a GTX260, and you
> have to do this for 204 keystream samples, totaling 8500 seconds on
> a single node with all the tables and 21 seconds with 400 nodes sharing
> the computation.
>
>
> ------------------------------
>
> Message: 5
> Date: Tue, 3 Nov 2009 19:09:57 +0100
> From: Karsten Nohl <[email protected]>
> Subject: Re: [A51] RT trade-offs  (Re:  Hosting the tables)
> To: [email protected]
> Message-ID: <[email protected]>
> Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes
>
>
> On Nov 3, 2009, at 6:48 PM, sascha wrote:
>
>> On Tue, Nov 03, 2009 at 06:12:00PM +0100, Karsten Nohl wrote:
>>> We use a mix of these two techniques but stay closer to the
>>> distinguished points scenario.
>>> Parameters for the tables being computed at the moment are:
>>> M = 64  (A5/1)
>>> D = 38  (4 TB)
>>> C = 6  (50% success rate)
>>> => T = 33.5 (2 minutes on a computer with GTX260 that has all tables;
>>> less when the distributed cracking network is used)
>>>
>> This number is not correct. For a lookup of a single value, you have
>> to compute
>> (1 + 2 + 3 + ... + 32) * 2^15 A5/1 rounds, which is 528 * 2^15 =
>> 17.3 million
>> To do the lookup in all 400 tables, you need 7 billion rounds.
>> 7 billion links can be computed in 42 seconds on a GTX260, and you
>> have to do this for 204 keystream samples, totaling 8500 seconds on
>> a single node with all the tables and 21 seconds with 400 nodes
>> sharing
>> the computation.
>
> that's correct -- my bad. I was using the formula for pure
> distinguished points tables.
>
>
> ------------------------------
>
> Message: 6
> Date: Tue, 3 Nov 2009 19:16:53 +0100
> From: sascha <[email protected]>
> Subject: Re: [A51] RT trade-offs  (Re:  Hosting the tables)
> To: [email protected]
> Message-ID: <20091103181653.ga6...@test>
> Content-Type: text/plain; charset=us-ascii
>
> On Tue, Nov 03, 2009 at 07:09:57PM +0100, Karsten Nohl wrote:
>>
>> On Nov 3, 2009, at 6:48 PM, sascha wrote:
>>
>> > On Tue, Nov 03, 2009 at 06:12:00PM +0100, Karsten Nohl wrote:
>> >> We use a mix of these two techniques but stay closer to the
>> >> distinguished points scenario.
>> >> Parameters for the tables being computed at the moment are:
>> >> M = 64  (A5/1)
>> >> D = 38  (4 TB)
>> >> C = 6  (50% success rate)
>> >> => T = 33.5 (2 minutes on a computer with GTX260 that has all tables;
>> >> less when the distributed cracking network is used)
>> >>
>> > This number is not correct. For a lookup of a single value, you have
>> > to compute
>> > (1 + 2 + 3 + ... + 32) * 2^15 A5/1 rounds, which is 528 * 2^15 =
>> > 17.3 million
>> > To do the lookup in all 400 tables, you need 7 billion rounds.
>> > 7 billion links can be computed in 42 seconds on a GTX260, and you
>> > have to do this for 204 keystream samples, totaling 8500 seconds on
>> > a single node with all the tables and 21 seconds with 400 nodes
>> > sharing
>> > the computation.
>>
>> that's correct -- my bad. I was using the formula for pure
>> distinguished points tables.
>
> And i guess with all tables having no round function which
> is impossible to do because of the merges.
> To show rationale for rainbow tables again, with 400 * 32 DP tables
> (having the same number of merges as 400 rainbow tables),
> you have to compute 32 * 2^20 * 400 * 204 A5/1 rounds instead
> of 528 * 2^15 * 400 * 204, which is half as much.
>
>
> ------------------------------
>
> Message: 7
> Date: Tue, 3 Nov 2009 21:52:53 +0100
> From: sascha <[email protected]>
> Subject: Re: [A51] CPU implementation (more code)
> To: [email protected]
> Message-ID: <20091103205253.ga7...@test>
> Content-Type: text/plain; charset=us-ascii
>
>> Regarding CPU implementations:
> Some good news:
>
> The SSE version achieves 18 chains/second per core of my 2ghz core2duo.
> The Cell version runs at 10 chains/second per SPU (3.2ghz).
>
> some polishing has to be done before the code can be released.
> Anyone here paying the rent for roadrunner for 24 hours?
>
>
>
> ------------------------------
>
> Message: 8
> Date: Wed, 04 Nov 2009 00:06:28 +0100
> From: Sascha Krissler <[email protected]>
> Subject: Re: [A51] CPU implementation (more code)
> To: [email protected]
> Message-ID: <[email protected]>
> Content-Type: text/plain; charset=iso-8859-15
>
> Forward to the list from PM
>
>
>> So you now use m128i registers and do instructions on 128 bit wide
>
> yes
>
>> values? I assume that what you are doing is what is usually called bit
>> slicing?
>
> yes
>
>> For both Cell and SSE2, do you use interlacing of instructions to fill
>> up the pipeline (and usually improve speed by a factor 2 or 3)? (not
>> sure if that really helps on bit sliced calculations though)
>
> i did not pay special attention to the instruction scheduling.
> The heavy load loop is unrolled and gcc will have plenty of opportunity
> to reorder the instructions.
> Also the loop body is executed ~ 650m times per second on a 2ghz core,
> effectivly using 3 clocks per loop for the 7 instructions (unrolled)
>
> on the cell spu it is half as fast despite higher clocks, so i should try to 
> optimize there.
> (but the SPU is also not fully multiscalar, so should be half as fast)
>
> with the SPU being in-order, instruction instruction reordering and 
> scheduling may pay off.
>
> This function does 90% of the work:
> nshift == 0, lenght= 19 or 22 or 23, RT=ssevector(uint64, uint64)
>
> template <int base, int length, int nshift, typename RT>
> void lsh_reg(RT * regs, RT clock3) {
>  RT clock2 = ~clock3;
>  int i;
>  for (i = base - length + 1; i < base; ++i) {
>    regs[i] = regs[i + nshift] & clock2 | regs[i + nshift + 1] & clock3;
>  }
>
>
>
> Before loop unrolling:
>
> xmm0: clock2
> xmm6: ~clock2
> 720(...): regs[i + nshift]
> 736(...): regs[i + nshift + 1]
>
> L41:
>        .loc 2 110 0
>        movdqa  720(%ecx,%eax), %xmm1
>        movdqa  736(%ecx,%eax), %xmm2
> L18:
>        .loc 2 111 0
>        pand    %xmm0, %xmm1
>        pand    %xmm6, %xmm2
>        por     %xmm2, %xmm1
>        movdqa  %xmm1, 720(%ecx,%eax)
>        addl    $16, %eax
>        .loc 2 110 0
>        cmpl    $288, %eax
>        jne     L41
>
> After loop unrolling this pattern repeats:
> xmm2, xmm6: clock2, ~clock2, 720(), 736() regs[...]
>
>        movdqa  %xmm2, %xmm0
>        movdqa  %xmm6, %xmm1
>        pand    720(%ebx,%eax), %xmm0
>        pand    736(%ebx,%eax), %xmm1
>        por     %xmm1, %xmm0
>        movdqa  %xmm0, 720(%ebx,%eax)
>        leal    32(%edx), %eax
>
>
> ______________________________________________________
> GRATIS f?r alle WEB.DE-Nutzer: Die maxdome Movie-FLAT!
> Jetzt freischalten unter http://movieflat.web.de
>
>
>
> ------------------------------
>
> Message: 9
> Date: Wed, 04 Nov 2009 06:43:55 +0100
> From: "Frank A. Stevenson" <[email protected]>
> Subject: Re: [A51] Fast sorting & compact tables
> To: A51 Mailing List <[email protected]>
> Message-ID: <[email protected]>
> Content-Type: text/plain; charset=ISO-8859-1; format=flowed
>
> Frank A. Stevenson wrote:
>> I wrote a table sorting function, ...
>>
>> Tarball is available for inspection at: http://traxme.net/a5/ (native
>> endianness is assumed when reading the tables)
>>
>> F
>>
> I have done more research on efficient table storage. My initial
> suggestion of using 1 million distict files would simply overwhelm a
> ext2 forametted flash device with inodes, so I had to reduce the number
> of files to a more manageble 4096. I have updated my tables sorting code
> accordingly (on the website for now, will add to svn later ) I have also
> added code that does the lookup into the compressed tables (not finished)
>
> The interesting figures from a 8GB Kingston datatraveler with 5MB/sec
> write speed are:
>
> 1000 lookups in a smaller table (40million chains) can be performed in
> 2.1 seconds from a cold device (newly inserted, with no locally cached
> inodes) - access speed is not an issue :-)
>
> Time to sort a complete table would be limited to the write speed of the
> device (~15 minutes). It is also interesting to note that the suggested
> table size should fit in < 4GB in compressed format. So an even
> multiples of tables can easily be stored on devices in commonly
> available sizes.
>
> The big advantage with flash based storage is that it won't bog down
> your machine with a torrent of disk requests when doing th lookup work.
>
> F
>
>
>
>
> ------------------------------
>
> _______________________________________________
> A51 mailing list
> [email protected]
> http://lists.lists.reflextor.com/cgi-bin/mailman/listinfo/a51
>
>
> End of A51 Digest, Vol 6, Issue 5
> *********************************
>




-- 
The Elata Foundation
www.elata.org
_______________________________________________
A51 mailing list
[email protected]
http://lists.lists.reflextor.com/cgi-bin/mailman/listinfo/a51

Re: [A51] A51 Digest, Vol 6, Issue 5

Reply via email to