[amibroker] Cuda prototype using 64 cores on graphics card

dloyer123 Thu, 24 Jul 2008 12:13:28 -0700

I was able to get a AmiBroker dll to work with Nvidia CUDA drivers.

These drivers allow C code to run on the graphics shares of a modern 
video card.  These are the same processors that allow high speed 3d 
graphics.  Several math intensive applications report a 50-100 fold 
performance improvement over running on the host cpu.


The mid range card that came on my system has 64 cores, each able to 
perform one floating point operation per clock.

As a simple test, I wrote a AmiBroker plug in, called by AFL.  

It calculated the average price (H+L+C)/3 for 60464 bars in 21us.

This works out to about 8.5GF (billion floating point operations per 
second) and 46GB/s memory transfer speed.  (read 3 floats and write 
one per bar), (2 floating point adds and 1 multiply per bar)

The 46GB/s transfer rate is not far from the available memory 
bandwidth on the card, but the simple test calculation is not 
very "dense" so, I should be able to get a much higher calculation 
rate once I move more of my code to the graphics cores.  Several of 
the CUDA demos report > 150GF/s.  Memory is the bottleneck of this 
simple test.  I used one thread per bar.

High end graphics cards are available now that would improve 
performance by another factor of 2 to 4.  

A few problems:
* The above numbers do not include the time needed to copy the data 
from ami to the graphics card or copy the results back.  This time is 
much greater than the calculation time in this simple test.
* This is not a general AFL accelerator.  

My goal is to reduce my current 25s backtest time down to < 1s per 
pass.  To do this, I will need to move the data set for all symbols 
to the graphics card once and make many passes over the data with 
different optimization values.  Each CUDA thread will work on one 
symbol, rather than a thread per bar as in my first test.   

There is not much point in writing a CUDA routine to just execute 
directly from AFL code.  There is too much overhead.  In my 
application, the AFL code is a very small part of the total time for 
each backtest.  Even if I reduced the time to zero, it would not 
reduce the time per pass very much.  Also, the time needed to copy 
the price data on each pass would greatly reduce the benefit.  As far 
as I can tell, the current Ami API does not allow injecting a 
externally generated trade list into the backtest, so I will need to 
perform the full backtest and fitness function calculation 
externally.  

I had no compatibility problems getting the CUDA api to run as a Ami 
plug in.  

Why go to the trouble?  Using Fred's IO program would get much of the 
same benefit for less trouble, or I could wait until Ami finally 
supports multi cores, or finds other clever ways to reduce the per 
pass overhead.  The real answer is that I just had to try it....

[amibroker] Cuda prototype using 64 cores on graphics card

Reply via email to