A gpu kernel that computes 256 steps takes 40ms.
This is in the configuration with 16 threads per physical core,
which could be scaled down to 4 threads per core.
You will not see a fourfold throughput increase though.

The processing between kernel calls is another 15msec, that can be
avoided if you have enough data to fill a double buffer.

A 5870 has 320 cores, so that you would be processing 320 * 16 * 32
chains at once.

A burst requires 40 * 8 * 51 = 16320 chain slots.

Also study:
https://opensource.srlabs.de/attachments/61/stats.1.svg

e??t and e??b are the total number of lookups on flash, and the current
        queue size of requests waiting for processing
s??t and s??b are the same for hard disk lookups

1) toendt and toendb are totals and queue size for the initial computation
        to the end value
2) fromstart are the stats for the initial computation from the start value
        to the target round
3) lastround are the stats for the step-by-step comparison during the last round

1) and 2) are GPU with bitslice code
3) runs on the CPU with non bitsliced code

Each e?? represents one 16gbyte usb flash disk, one 64gbyte flash disk
(the thin teal line going wayyy up). one of the 16gbyte disks was not
full (thin orange line).
All HDs were actually images stored on a single 12 physical disk raid IIRC.

The left axis shows the number of lookups (HD lookups are 1/4 of total
lookups, because not every endpoint can be found on the flash)
The right axis is supposed to show the number of chains computed, the
thick orange bar would reach 24 * 8 * 408, because 24 tables were used,
lastround_t reaches 25% of that, because of the hit rate of 25%.

Without the HD bottleneck and the CPU bottleneck you would have a 10 second
lookup time for 8 bursts.

Oh and it seems e??b was left out, because there is no backlog to speak of
(you can see the flash coming to a halt, after the initial (toend) computations
end)


On Thu, Feb 17, 2011 at 09:41:00PM +0100, Sylvain Munaut wrote:
> Hi,
> 
> What are the current performance of the lookup code running on the GPU
> ? (even in not yet released version).
> In rounds per second ? (one round being going from 64 bits states ->
> 100 + 64 bits generated)
> 
> Cheers,
> 
>     Sylvain
> _______________________________________________
> A51 mailing list
> A51@lists.reflextor.com
> http://lists.lists.reflextor.com/cgi-bin/mailman/listinfo/a51
_______________________________________________
A51 mailing list
A51@lists.reflextor.com
http://lists.lists.reflextor.com/cgi-bin/mailman/listinfo/a51

Reply via email to