We don't really need a finer grain knowledge about the processor at compile time.
There are some other open-source projects which have already done something very similar if not identical; one of them is the media player mplayer (http://www.mplayerhq.hu/). Why not using these as starting points ?
The second question is how and when to figure out which of the available memcpy functions give the best performance.
This depends a lot on whether the job has the nodes all by itself or the nodes are shared with other jobs - if so, the data transfer between CPU and RAM while benchmarking can be significantly skewed.
On a homogeneous architecture, this might be a one node selection [I don't imagine using the modex to spread this information]
Hmm, doesn't sound nice to have n-1 nodes waiting while 1 node does the test. Maybe run it on all nodes and compare results ? And warn the user if different mempcy versions would be chosen..
The really annoying thing here, is that in the best case [in a perfect world] this should be done once per cluster.
... and, in the view of node sharing pointed above, when the benchmarking can have the nodes all by itself. This sounds very much like the collectives tuning, with MCA params to give the admin or user view of how the best performance can be achieved.
-- Bogdan Costescu IWR, University of Heidelberg, INF 368, D-69120 Heidelberg, Germany Phone: +49 6221 54 8869/8240, Fax: +49 6221 54 8868/8850 E-mail: bogdan.coste...@iwr.uni-heidelberg.de