Hi Matt, > Karl, I am assuming that the places in the article where the Phi beats > the K20 are for denser matrices > where they have explicitly vectorized?
a quick check with the matrices in the paper showed that it is indeed the matrices with a higher number of nonzeros per row for which the Xeon Phi offers higher performance than the K20 (correlation, not causality). There's still a bunch of impact from reordering dofs (and I think one can also modify reordering algorithms to better suit accelerators/GPUs), but overall I support your observation. The CSR format used in the paper is not necessarily optimal for MIC and GPUs, but that's a different story... Best regards, Karli
