Hi,

On 28/06/15 15:09, CatDog wrote:
> 1. CUDA is not applicable because of memory limit, is it possible to 
> circumvent this problem? I have 256 GB ram for cpu.

No.  Generally this is not a problem in the sense that for real world
simulations you'll almost always be compute -- as opposed to memory --
bound.  As a point of reference if you fully load up an NVIDIA K40c (12
GiB of memory) with a simulation to get any reasonable statistics out of
it you will probably need to run the simulation for three weeks or more.

> 2. How to interpret the OPENMP results? what is the difference 
> between parallel and serial.

The OpenMP results depend heavily on the configuration of your system
and what BLAS library you're using.  A key point is that OpenMP only
performs well inside of a single NUMA zone.

For instance, if you have 64 AMD cores in a single system then you
probably have four sockets each with a 16 core CPU.  Each of these CPUs
will have two NUMA zones for a total of eight NUMA zones.  Therefore,
the optimal configuration is to partition the mesh into eight pieces and
run each piece with four threads.  Care is necessary to ensure that
these threads are 'pinned' to the correct cores.  Getting this right
when using a combination of MPI + OpenMP on a single system can
sometimes be painful.

The parallel vs serial distinction depends on if the BLAS library you
are using is multi-threaded or not.  If it is multi-threaded then you'll
want to set this to be parallel, otherwise serial.  The recommendation
is to use a single threaded BLAS library (ATLAS works best, followed by
MKL, and then OpenBLAS) and let PyFR do the parallelism as opposed to
the BLAS library itself.


> 3. I thought MPI is favorable on cluster rather than on a single
> server. Why MPI+OPENMP seems faster than using OPENMP solely?

Practically a system with eight NUMA zones is basically eight separate
systems with cache coherency.

> 4. Why OPENCL seems faster than other available configuration?

It is problem and system specific.  In my experience when tuned
correctly the OpenMP backend should be able to outperform the OpenCL
backend at higher polynomial orders.  However, it does require more work
to configure.

Regards, Freddie.

-- 
You received this message because you are subscribed to the Google Groups "PyFR 
Mailing List" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send an email to [email protected].
Visit this group at http://groups.google.com/group/pyfrmailinglist.
For more options, visit https://groups.google.com/d/optout.

Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to