Hi John, > (...) > I fully second Jed. Computational scientists are already fighting > with getting scalable performance on a 'standard' multi-core > architecture, so I doubt that one can really obtain a gain on an > accelerator-architecture for any real-world application just be > recompilation of existing code. Also, add the extra issue of > PCI-Express latency. > > > Two key points here: > > 1) the application will have to be threaded to get good performance on > the Xeon Phi. I know that PETSc is moving in this direction. My thought > was that you would have 1 MPI process on the card and 1 on each CPU and > use threads.
I'd be happy if it were that simple, but I doubt this. Even Intel is saying that the Xeon Phi is an accelerator architecture rather than a multi-core architecture. > 2) The recompilation is needed to run in "Native mode". This is not an > offloaded computation in the GPU sense. The entire program runs on the > card. All the memory is local. You run one binary on the card, a > different binary on the CPU. The only thing that has to cross the bus > is MPI communication, which should be faster than even the fastest > network cards because it only has to cross the bus. Hmm, that could indeed get past the latency issue to a large extent. Probably some OS-functionality is not available on the Xeon Phi, thus some redesigning would still be required. Let's see... Best regards, Karli
