For a 32 rounds table with 1024*1024*1024 start values: 1 1073741823 2 1023438982 3 976087268 4 932768763 5 893795050 6 857258423 7 824594686 8 793842630 9 764984529 10 742264968 11 717542086 12 693794500 13 671348674 14 653519282 15 633911863 16 615705741 17 600811372 18 584022929 19 567966279 20 553038610 21 541126688 22 527714176 23 514813648 24 504121909 25 492442040 26 480922764 27 469931269 28 461146395 29 451036696 30 441442433 31 433670029 32 ? 33 ?
sum: 20.5 Grounds (instead of 30 Grounds) first column is the round number starting at 1, the second is the number of chains *before* that round. After the last round the chains were written unsorted (and no chain count printed there), then the final sorter purged some more (numbers lost), you have to interpolate. I could also do a 270M table in between to find its values. I also have stats for 8 rounds table, 1Gchains at the start 1 1073741824 2 1022078072 3 975558741 4 932155936 5 893248017 6 856191164 7 822026508 8 791037697 9 ? sum 7.3 Grounds instead of 8 On Mon, Jan 11, 2010 at 10:48:26PM +0100, Frank A. Stevenson wrote: > Some news, and a question to consider. > > I managed to speed up the ATI chain generation by somewhere close to > 30%. Currently a single 5850 card can calculate ~650 chains pr second, > meaning my dual card setup can complete a table in less than 2.5 days. > > I did this by unrolling my loop a little, and rather than shifting the > output, I use indexed writes to cached memory. Since ALU clauses are run > in parallel with memory fetches in the GPU threading engine this is > almost a pure gain. > > I still have some more tricks up my sleeves, but first a question for > Karsten or whoever would like to do the maths: > > I am thinking about "merge free table generations", and the procedure > goes like this: > > Start with 270M points, and calculate the first round only and write to > disk. Then read that output, and bucket sort the DP1s, eliminating any > merges. For non merges, calculate the second round and write to disk. > Repeat this for every 32 rounds, keeping fewer and fewer chains, and you > will have produced a table containing only merges from the 32nd round. > > Clearly this is faster, as disk access is much quicker than calculating > the rounds, but the real question is how much work can you eliminate > this way ? What speedup will you get ? > > f > > > > > _______________________________________________ > A51 mailing list > [email protected] > http://lists.lists.reflextor.com/cgi-bin/mailman/listinfo/a51 _______________________________________________ A51 mailing list [email protected] http://lists.lists.reflextor.com/cgi-bin/mailman/listinfo/a51
