Well, I have 9 equations, so 9x9 I guess... I hope the one you are mentioning was a major bug, because what I get is seriously wrong: while on single GPU (KSPGMRES+PCASM) I get a residual of +0.72, on 8-cores/GPU I get -1.00 at the first time step, just to make an example. Can this be due to the bug you are saying or you can suspect something more?
What should I do then? wait for the valgrind fix which is underway and then update? Can you please notify me when this is fixed? I'm writing a final report for a project and I would like to include this feature fully fixed if possible. Another question, what do you exactly mean by "order the unknowns properly" in this case? Thanks a lot! Andrea On Fri, Jan 17, 2014 at 10:02 PM, Karl Rupp <[email protected]> wrote: > Hi Andrea, > > > In fact, I have another major problem: when running on multi-GPU with >> PETSc my results are totally inconsistent compared to a single GPU . >> > > This was a bug which was fixed a couple of days ago. It is in branch > 'next', but not yet merged to master since it has another valgrind issue I > haven't nailed down yet. > > > > In my code, for now, I'm assuming a 1-1 correspondence between CPU and >> GPU: I run on 8 cores and 8 GPUs (4 K10). How can I enforce this in the >> PETSc solver? Is it automatically done or do I have to specify some >> options? >> > > One MPI rank maps to one logical GPU. In your case, please run with 8 MPI > ranks and distribute them equally over the nodes equipped with the GPUs. > > As for the preconditioners: We haven't added any new preconditioners > recently. Preconditioning on GPUs is a very problem-specific thing due to > the burden of PCI-Express latency. Massively parallel approaches such as > Sparse Approximate Inverses perform well in terms of theoretical FLOP > counts, but are poor in terms of convergence and pretty expensive in terms > of memory when running many simultaneous factorizations. ILU on the GPU can > be fast if you order the unknowns properly and have only few nonzeros per > row, but it is not great in terms of convergence rate either. PCI-Express > bandwidth and latency is really a problem here... > > How large are your blocks when using a block-Jacobi preconditioner for > your problem? In the order of 3x3 or (much) larger? > > Best regards, > Karli > > -- Dr. Andrea Lani Senior Research Engineer, PhD Aeronautics & Aerospace dept., CFD group Von Karman Institute for Fluid Dynamics Chausse de Waterloo 72, B-1640, Rhode-Saint-Genese, Belgium fax : +32-2-3599600 work : +32-2-3599769 *[email protected] <[email protected]>*
