Now that we're taking more advantage of PyCUDA's and CodePy's ability to generate really precise special-case code... I'm finding that we wind up with a lot of ambiguities about *which* generator should handle a given special case. The right choice for a particular input structure is platform-dependent--a function of cache sizes, access latencies, transfer bandwidth, register counts, number of processors, etc, etc. The wrong choice can carry a big performance penalty.
FFTW and ATLAS get around this by self-tuning algorithms, which I don't understand in detail, but which generally work by trying a lot of generators on a lot of special cases, and then using the database of timings to make good choices quickly at runtime. It seems like this automatic-tuning is even more important for GPU implementations than for CPU ones. Are there libraries to help with this? James -- http://www-etud.iro.umontreal.ca/~bergstrj _______________________________________________ PyCUDA mailing list [email protected] http://tiker.net/mailman/listinfo/pycuda_tiker.net
