Hi Andrea,
In fact, I have another major problem: when running on multi-GPU with PETSc my results are totally inconsistent compared to a single GPU .
This was a bug which was fixed a couple of days ago. It is in branch 'next', but not yet merged to master since it has another valgrind issue I haven't nailed down yet.
In my code, for now, I'm assuming a 1-1 correspondence between CPU and GPU: I run on 8 cores and 8 GPUs (4 K10). How can I enforce this in the PETSc solver? Is it automatically done or do I have to specify some options?
One MPI rank maps to one logical GPU. In your case, please run with 8 MPI ranks and distribute them equally over the nodes equipped with the GPUs.
As for the preconditioners: We haven't added any new preconditioners recently. Preconditioning on GPUs is a very problem-specific thing due to the burden of PCI-Express latency. Massively parallel approaches such as Sparse Approximate Inverses perform well in terms of theoretical FLOP counts, but are poor in terms of convergence and pretty expensive in terms of memory when running many simultaneous factorizations. ILU on the GPU can be fast if you order the unknowns properly and have only few nonzeros per row, but it is not great in terms of convergence rate either. PCI-Express bandwidth and latency is really a problem here...
How large are your blocks when using a block-Jacobi preconditioner for your problem? In the order of 3x3 or (much) larger?
Best regards, Karli
