Thank you very much. 

This is the information I was looking for.

El jueves, 5 de marzo de 2015, 18:55:48 (UTC-5), Freddie Witherden escribió:
> On 05/03/15 15:58, Pablo Restrepo Henao wrote: 
> > One of the applications we have to run for the competition is Pyfr, so i 
> > was wondering if you maybe have some example datasets, beside the two 
> > that come with PyFr itself, so we can test performance in different 
> > hardware options. 
> I would certainly recommend avoiding the two provided test cases for 
> benchmarking purposes.  Both are two dimensional and contain relatively 
> few elements.  As run-times are dominated by overhead from the Python 
> later, inefficiencies in BLAS, and memory bandwidth. 
> 3D test cases are far more suitable and realistic.  I am sure someone on 
> the mailing list will post some. 
> > In addition, I was wondering if you have any advice regarding to use 
> > CPUS, GPUS or (if possible) a combination of both for getting a better 
> > performance with pyFR. 
> The CUDA backend should perform reasonably well out of the box.  Very 
> little configuration should be required on the software side of things. 
> The OpenCL backend is a little bit more complicated.  The underlying 
> clBLAS library supports auto-tuning and this can result in improved 
> performance on some devices.  There is some documentation on the clBLAS 
> github page on how to go about tuning clBLAS. 
> Finally, there is the C/OpenMP backend.  There are a variety of options 
> here.  Firstly, you will want to run with one MPI rank per NUMA zone 
> (usually a socket).  So use pyfr-mesh to partition the mesh into as many 
> pieces as there are sockets.  Be sure that the MPI/OpenMP libraries are 
> getting along and correctly binding threads and processes to cores and 
> processors. 
> PyFR is capable of using either a serial or parallel cblas library.  If 
> a serial BLAS library is specified then PyFR will perform its own 
> paralleisation using OpenMP.  This almost always outperforms the 
> multi-threading done by the BLAS libraries themselves. 
> In my experience ATLAS (serial) tends to perform best, followed by MKL 
> (serial) and OpenBLAS (serial).  The trick to using MKL is to specify it 
> as: 
>   [backend-openmp] 
>   cblas-st = ; if letting PyFR do the threading 
>   cblas-mt = ; if letting MKL do the threading 
> To stop MKL from performing its own threading (allowing PyFR to do the 
> threading) you can export MKL_NUM_THREADS=1. 
> The choice of compiler for the OpenMP backend does not normally make a 
> huge amount of difference.  Although it can be specified via 
>  cc = my_compiler_command 
> PyFR is capable of running heterogeneously.  However, the domain 
> decomposition must be done manually.  The idea is to run one MPI rank 
> per GPU/OpenCL device/NUMA zone.  Start by partitioning the mesh with a 
> suitable set of weighting factors (these need to be determined 
> empirically).  So if we had a single CUDA GPU and two CPUs we might do: 
>   pyfr-mesh partition 10:2:2 mymesh.pyfrm . 
> with the first partition being given a weight of 10, and the remaining 
> two a weight of 2.  To run the simulation do: 
>   mpirun -np 3 ./ mymesh.pyfrm mycfg.ini 
> where is something along the lines of: 
> #!/bin/bash 
>     "0" ) 
>         BACKEND="cuda" ;; 
>     "1" ) 
>         BACKEND="openmp" ;; 
>     "2" ) 
>         BACKEND="openmp" ;; 
> esac 
> pyfr-sim -b${BACKEND} -p run $1 $2 
> The above uses the ${MV2_COMM_WORLD_LOCAL_RANK} environmental variable 
> to get our node local MPI rank before MPI_Init has been called.  This is 
> MVAPICH2 specific although OpenMPI provides a similar variable.  The 
> first rank then picks up the CUDA backend while the remaining two get 
> the OpenMP backend. 
> Hope this helps. 
> Regards, Freddie. 

