On 05/03/15 15:58, Pablo Restrepo Henao wrote:
> One of the applications we have to run for the competition is Pyfr, so i
> was wondering if you maybe have some example datasets, beside the two
> that come with PyFr itself, so we can test performance in different
> hardware options.

I would certainly recommend avoiding the two provided test cases for
benchmarking purposes.  Both are two dimensional and contain relatively
few elements.  As run-times are dominated by overhead from the Python
later, inefficiencies in BLAS, and memory bandwidth.

3D test cases are far more suitable and realistic.  I am sure someone on
the mailing list will post some.

> In addition, I was wondering if you have any advice regarding to use
> CPUS, GPUS or (if possible) a combination of both for getting a better
> performance with pyFR.

The CUDA backend should perform reasonably well out of the box.  Very
little configuration should be required on the software side of things.

The OpenCL backend is a little bit more complicated.  The underlying
clBLAS library supports auto-tuning and this can result in improved
performance on some devices.  There is some documentation on the clBLAS
github page on how to go about tuning clBLAS.

Finally, there is the C/OpenMP backend.  There are a variety of options
here.  Firstly, you will want to run with one MPI rank per NUMA zone
(usually a socket).  So use pyfr-mesh to partition the mesh into as many
pieces as there are sockets.  Be sure that the MPI/OpenMP libraries are
getting along and correctly binding threads and processes to cores and
processors.

PyFR is capable of using either a serial or parallel cblas library.  If
a serial BLAS library is specified then PyFR will perform its own
paralleisation using OpenMP.  This almost always outperforms the
multi-threading done by the BLAS libraries themselves.

In my experience ATLAS (serial) tends to perform best, followed by MKL
(serial) and OpenBLAS (serial).  The trick to using MKL is to specify it as:

  [backend-openmp]
  cblas-st = libmkl_rt.so ; if letting PyFR do the threading
  cblas-mt = libmkl_rt.so ; if letting MKL do the threading

To stop MKL from performing its own threading (allowing PyFR to do the
threading) you can export MKL_NUM_THREADS=1.

The choice of compiler for the OpenMP backend does not normally make a
huge amount of difference.  Although it can be specified via

 cc = my_compiler_command

PyFR is capable of running heterogeneously.  However, the domain
decomposition must be done manually.  The idea is to run one MPI rank
per GPU/OpenCL device/NUMA zone.  Start by partitioning the mesh with a
suitable set of weighting factors (these need to be determined
empirically).  So if we had a single CUDA GPU and two CPUs we might do:

  pyfr-mesh partition 10:2:2 mymesh.pyfrm .

with the first partition being given a weight of 10, and the remaining
two a weight of 2.  To run the simulation do:

  mpirun -np 3 ./launcher.sh mymesh.pyfrm mycfg.ini

where launcher.sh is something along the lines of:

#!/bin/bash

case ${MV2_COMM_WORLD_LOCAL_RANK} in
    "0" )
        BACKEND="cuda" ;;
    "1" )
        BACKEND="openmp" ;;
    "2" )
        BACKEND="openmp" ;;
esac

pyfr-sim -b${BACKEND} -p run $1 $2

The above uses the ${MV2_COMM_WORLD_LOCAL_RANK} environmental variable
to get our node local MPI rank before MPI_Init has been called.  This is
MVAPICH2 specific although OpenMPI provides a similar variable.  The
first rank then picks up the CUDA backend while the remaining two get
the OpenMP backend.

Hope this helps.

Regards, Freddie.


-- 
You received this message because you are subscribed to the Google Groups "PyFR 
Mailing List" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send an email to [email protected].
Visit this group at http://groups.google.com/group/pyfrmailinglist.
For more options, visit https://groups.google.com/d/optout.

Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to