This is very interesting ... AB Dll's are one thing ... Do you think it's 
possible to run individual instances of AB itself with CUDA ?

----- Original Message -----
From: dloyer123 
Date: Thursday, July 24, 2008 3:13 pm
Subject: [amibroker] Cuda prototype using 64 cores on graphics card
To: [email protected]

> I was able to get a AmiBroker dll to work with Nvidia CUDA drivers.
> 
> These drivers allow C code to run on the graphics shares of a 
> modern 
> video card. These are the same processors that allow high speed 
> 3d 
> graphics. Several math intensive applications report a 50-100 
> fold 
> performance improvement over running on the host cpu. 
> 
> The mid range card that came on my system has 64 cores, each 
> able to 
> perform one floating point operation per clock.
> 
> As a simple test, I wrote a AmiBroker plug in, called by AFL. 
> 
> It calculated the average price (H+L+C)/3 for 60464 bars in 21us.
> 
> This works out to about 8.5GF (billion floating point operations 
> per 
> second) and 46GB/s memory transfer speed. (read 3 floats and 
> write 
> one per bar), (2 floating point adds and 1 multiply per bar)
> 
> The 46GB/s transfer rate is not far from the available memory 
> bandwidth on the card, but the simple test calculation is not 
> very "dense" so, I should be able to get a much higher 
> calculation 
> rate once I move more of my code to the graphics cores. Several 
> of 
> the CUDA demos report > 150GF/s. Memory is the bottleneck of 
> this 
> simple test. I used one thread per bar.
> 
> High end graphics cards are available now that would improve 
> performance by another factor of 2 to 4. 
> 
> A few problems:
> * The above numbers do not include the time needed to copy the 
> data 
> from ami to the graphics card or copy the results back. This 
> time is 
> much greater than the calculation time in this simple test.
> * This is not a general AFL accelerator. 
> 
> My goal is to reduce my current 25s backtest time down to < 1s 
> per 
> pass. To do this, I will need to move the data set for all 
> symbols 
> to the graphics card once and make many passes over the data 
> with 
> different optimization values. Each CUDA thread will work on 
> one 
> symbol, rather than a thread per bar as in my first test. 
> 
> There is not much point in writing a CUDA routine to just 
> execute 
> directly from AFL code. There is too much overhead. In my 
> application, the AFL code is a very small part of the total time 
> for 
> each backtest. Even if I reduced the time to zero, it would not 
> reduce the time per pass very much. Also, the time needed to 
> copy 
> the price data on each pass would greatly reduce the benefit. 
> As far 
> as I can tell, the current Ami API does not allow injecting a 
> externally generated trade list into the backtest, so I will 
> need to 
> perform the full backtest and fitness function calculation 
> externally. 
> 
> I had no compatibility problems getting the CUDA api to run as a 
> Ami 
> plug in. 
> 
> Why go to the trouble? Using Fred's IO program would get much 
> of the 
> same benefit for less trouble, or I could wait until Ami finally 
> supports multi cores, or finds other clever ways to reduce the 
> per 
> pass overhead. The real answer is that I just had to try it....
> 
> 
> 
> 
> 
> 

Reply via email to