Thanks a lot, Karli! I will update, make a few tests and let you know if my problem is fixed! Best regards
Andrea On Jan 18, 2014, at 10:26 AM, Karl Rupp <[email protected]> wrote: > Hi Andrea, > > the fix is now merged to master: > https://bitbucket.org/petsc/petsc/commits/087a195f1d07b315894e9d8ab1801a0ce993221c > > Best regards, > Karli > > > > On 01/17/2014 10:13 PM, Andrea Lani wrote: >> Well, I have 9 equations, so 9x9 I guess... >> >> I hope the one you are mentioning was a major bug, because what I get is >> seriously wrong: while on single GPU (KSPGMRES+PCASM) I get a residual >> of +0.72, on 8-cores/GPU I get -1.00 at the first time step, just to >> make an example. Can this be due to the bug you are saying or you can >> suspect something more? >> >> What should I do then? wait for the valgrind fix which is underway and >> then update? Can you please notify me when this is fixed? I'm writing a >> final report for a project and I would like to include this feature >> fully fixed if possible. >> >> Another question, what do you exactly mean by "order the unknowns >> properly" in this case? >> Thanks a lot! >> >> Andrea >> >> >> On Fri, Jan 17, 2014 at 10:02 PM, Karl Rupp <[email protected] >> <mailto:[email protected]>> wrote: >> >> Hi Andrea, >> >> >> In fact, I have another major problem: when running on multi-GPU >> with >> PETSc my results are totally inconsistent compared to a single >> GPU . >> >> >> This was a bug which was fixed a couple of days ago. It is in branch >> 'next', but not yet merged to master since it has another valgrind >> issue I haven't nailed down yet. >> >> >> >> In my code, for now, I'm assuming a 1-1 correspondence between >> CPU and >> GPU: I run on 8 cores and 8 GPUs (4 K10). How can I enforce >> this in the >> PETSc solver? Is it automatically done or do I have to specify >> some options? >> >> >> One MPI rank maps to one logical GPU. In your case, please run with >> 8 MPI ranks and distribute them equally over the nodes equipped with >> the GPUs. >> >> As for the preconditioners: We haven't added any new preconditioners >> recently. Preconditioning on GPUs is a very problem-specific thing >> due to the burden of PCI-Express latency. Massively parallel >> approaches such as Sparse Approximate Inverses perform well in terms >> of theoretical FLOP counts, but are poor in terms of convergence and >> pretty expensive in terms of memory when running many simultaneous >> factorizations. ILU on the GPU can be fast if you order the unknowns >> properly and have only few nonzeros per row, but it is not great in >> terms of convergence rate either. PCI-Express bandwidth and latency >> is really a problem here... >> >> How large are your blocks when using a block-Jacobi preconditioner >> for your problem? In the order of 3x3 or (much) larger? >> >> Best regards, >> Karli >> >> >> >> >> -- >> Dr. Andrea Lani >> Senior Research Engineer, PhD >> Aeronautics & Aerospace dept., CFD group >> Von Karman Institute for Fluid Dynamics >> Chausse de Waterloo 72, >> B-1640, Rhode-Saint-Genese, Belgium >> fax : +32-2-3599600 >> work : +32-2-3599769 _ >> [email protected] <mailto:[email protected]>_ >
