256 threads per block and as many blocks as there are multiprocessors, so 30 for a gtx280 and 27 for a gtx260SP216 for example. this was optimal when i did timing runs a while ago. if you do the calculation this means that there are 8 warps per MP and 32 threads per physical core. the values are probably best since they max out the number of threads and there are no context switches between blocks that produce (a small) overhead as you have to copy the lookup tables into the shared memory.
blocks and threads can be configured like --device cuda:blocks=23:threads=42 > what's the occupancy rate of current CUDA kernels on recent GPUs with compute > capability 1.3? > > > <hr> > _______________________________________________ > A51 mailing list > [email protected] > http://lists.lists.reflextor.com/cgi-bin/mailman/listinfo/a51 > > ________________________________________________________________ Neu: WEB.DE Doppel-FLAT mit Internet-Flatrate + Telefon-Flatrate für nur 19,99 Euro/mtl.!* http://produkte.web.de/go/02/ _______________________________________________ A51 mailing list [email protected] http://lists.lists.reflextor.com/cgi-bin/mailman/listinfo/a51
