Now that we're taking more advantage of PyCUDA's and CodePy's ability
to generate really precise special-case code... I'm finding that we
wind up with a lot of ambiguities about *which* generator should
handle a given special case.  The right choice for a particular input
structure is platform-dependent--a function of cache sizes, access
latencies, transfer bandwidth, register counts, number of processors,
etc, etc.  The wrong choice can carry a big performance penalty.

FFTW and ATLAS get around this by self-tuning algorithms, which I
don't understand in detail, but which generally work by trying a lot
of generators on a lot of special cases, and then using the database
of timings to make good choices quickly at runtime.

It seems like this automatic-tuning is even more important for GPU
implementations than for CPU ones.  Are there libraries to help with
this?

James
-- 
http://www-etud.iro.umontreal.ca/~bergstrj

_______________________________________________
PyCUDA mailing list
[email protected]
http://tiker.net/mailman/listinfo/pycuda_tiker.net

Reply via email to