Hi Andrea,

the fix is now merged to master:
https://bitbucket.org/petsc/petsc/commits/087a195f1d07b315894e9d8ab1801a0ce993221c

Best regards,
Karli



On 01/17/2014 10:13 PM, Andrea Lani wrote:
Well, I have 9 equations, so 9x9 I guess...

I hope the one you are mentioning was a major bug, because what I get is
seriously wrong: while on single GPU (KSPGMRES+PCASM) I get a residual
of +0.72, on 8-cores/GPU I get -1.00 at the first time step, just to
make an example. Can this be due to the bug you are saying or you can
suspect something more?

What should I do then? wait for the valgrind fix which is underway and
then update? Can you please notify me when this is fixed? I'm writing a
final report for a project and I would like to include this feature
fully fixed if possible.

Another question, what do you exactly mean by "order the unknowns
properly" in this case?
Thanks a lot!

Andrea


On Fri, Jan 17, 2014 at 10:02 PM, Karl Rupp <[email protected]
<mailto:[email protected]>> wrote:

    Hi Andrea,


        In fact, I have another major problem: when running on multi-GPU
        with
        PETSc my results are totally inconsistent compared to a single
        GPU  .


    This was a bug which was fixed a couple of days ago. It is in branch
    'next', but not yet merged to master since it has another valgrind
    issue I haven't nailed down yet.



        In my code, for now, I'm assuming a 1-1 correspondence between
        CPU and
        GPU: I run on 8 cores and 8 GPUs (4 K10).  How can I enforce
        this in the
        PETSc solver? Is it automatically done or do I have to specify
        some options?


    One MPI rank maps to one logical GPU. In your case, please run with
    8 MPI ranks and distribute them equally over the nodes equipped with
    the GPUs.

    As for the preconditioners: We haven't added any new preconditioners
    recently. Preconditioning on GPUs is a very problem-specific thing
    due to the burden of PCI-Express latency. Massively parallel
    approaches such as Sparse Approximate Inverses perform well in terms
    of theoretical FLOP counts, but are poor in terms of convergence and
    pretty expensive in terms of memory when running many simultaneous
    factorizations. ILU on the GPU can be fast if you order the unknowns
    properly and have only few nonzeros per row, but it is not great in
    terms of convergence rate either. PCI-Express bandwidth and latency
    is really a problem here...

    How large are your blocks when using a block-Jacobi preconditioner
    for your problem? In the order of 3x3 or (much) larger?

    Best regards,
    Karli




--
Dr. Andrea Lani
Senior Research Engineer, PhD
Aeronautics & Aerospace dept., CFD group
Von Karman Institute for Fluid Dynamics
Chausse de Waterloo 72,
B-1640, Rhode-Saint-Genese,  Belgium
fax  : +32-2-3599600
work : +32-2-3599769 _
[email protected] <mailto:[email protected]>_

Reply via email to