Hi Robert

>> 2) What is the rough DoF estimate for the strong scaling limit you observed
>>    with PyFR?
> 
> That is a good question and one which I do not have much of a feel for. I 
> would say that you want on the order of a thousand *elements* per GPU and 
> below that you may being to experience a flattening out of your strong 
> scaling curve.

Based on our Gordon Bell results on K20X, things start to tail off for 3D 
compressible Navier Stokes when you get down to ~ 1 element per CUDA core, 
where here an element is a P4 hexahedra with 5 x 5 x 5 x 5 DOFs = 625 DOFs.

Cheers

Peter

Dr Peter Vincent MSci ARCS DIC PhD
Reader in Aeronautics and EPSRC Fellow
Department of Aeronautics
Imperial College London
South Kensington
London
SW7 2AZ
UK

web: www.imperial.ac.uk/aeronautics/research/vincentlab
twitter: @Vincent_Lab





> On 3 Mar 2017, at 02:26, Freddie Witherden <[email protected]> wrote:
> 
> Hi Robert,
> 
> On 02/03/2017 21:12, Robert Sawko wrote:
>> Great... Yes, I didn't estimate the degrees of freedom for this. Trying to be
>> too quick. I've uploaded the sd7003 case. I added the residual printing and I
>> can already see it produced 2x speedup going from 1 to 2 nodes. I am doing a
>> full test now.
>> 
>> 
>> I have several related questions and comments:
>> 1) what is [backend] rank-allocator = linear? Does this not conflict with MPI
>>    options e.g. -rank-by from OMPI or binding policy of MVAPICH. This is
>>    significant for me as I have two GPUs per socket and 64 hardware threads
>>    per socket. I don't want 4 process to run on the first socket alone.
> 
> So the rank allocator decides how partitions in the mesh should be mapped 
> onto MPI ranks.  The linear allocator is exactly what you would expect: the 
> first MPI rank gets the first partition, and so on and so forth.  There is 
> also a random allocator.  Having four processes on one socket is probably 
> okay; I doubt you would notice much of a difference compared to an even 
> split.  When running with the CUDA backend PyFR is entirely single threaded 
> and offloads all relevant computation to the GPU.  We also work hard to mask 
> latencies for host-to-device transfers so even sub-optimal assignments 
> usually work out.
> 
>>    I print my bindings in MVAPICH and it looks ok, but I want to double check
>>    that python is not doing something else under the hood.
>> 
>> 2) What is the rough DoF estimate for the strong scaling limit you observed
>>    with PyFR?
> 
> That is a good question and one which I do not have much of a feel for. I 
> would say that you want on the order of a thousand *elements* per GPU and 
> below that you may being to experience a flattening out of your strong 
> scaling curve.
> 
>> 3) At the moment I am setting 4 MPI proc per node as I've got 4 GPUs, but I
>>    assume there's nothing to stop me from using more. Has anyone looked at
>>    optimal ratio of MPI processes to GPUs?
> 
> One MPI rank per GPU will be optimal.  Anything else will just introduce 
> extra overheads.
> 
> Regards, Freddie.
> 
> -- 
> You received this message because you are subscribed to the Google Groups 
> "PyFR Mailing List" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to [email protected].
> To post to this group, send an email to [email protected].
> Visit this group at https://groups.google.com/group/pyfrmailinglist.
> For more options, visit https://groups.google.com/d/optout.

-- 
You received this message because you are subscribed to the Google Groups "PyFR 
Mailing List" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send an email to [email protected].
Visit this group at https://groups.google.com/group/pyfrmailinglist.
For more options, visit https://groups.google.com/d/optout.

Reply via email to