Hi Zach,

On 04/11/14 02:34, Zach Davis wrote:
> I’m still looking into the OpenCL backend, but I think I was finally
> able to get the OpenMP backend up and running under OS X. I’ve collected
> a few *relative* performance benchmark results that perhaps some in the
> community might be interested in. Ideally, I would like to apply this
> test to see similar results for both CUDA and OpenCL backends.
> 
> My test system was a modest 2.4 GHz (i7–3630QM) quad-core Intel Core i7
> Ivy Bridge processor with 16GB 1600 MHz DDR3 RAM running OS X v10.10. I
> modified the compiler flags in
> ${PYFR_ROOT}/pyfr/backends/openmp/compiler.py replacing the
> -march=native option with -mtune=native. This change isn’t necessary for
> the clang-omp compiler, but was necessary for the gcc–4.9 compiler I
> tried. To keep things consistent, I left that option changed across
> compilers. I took the fastest run in my test matrix (i.e. the very last
> case) and re-ran the test while reverting the -mtune=native option back
> to -march=native and observed no change in runtime.
> 
> I ran the couette_flow_2d example case using a single partition
> initiating pyfr-sim as follows:
> 
> pyfr-sim -p -n 100 -b openmp run couette_flow_2d.pyfrm couette_flow_2d.ini
> 

At Peter noted that Couette flow case isn't great for benchmarking
purposes.  It has relatively few elements, all of which are 2D.  As a
consequence overheads (from Python, starting/joining threads, BLAS) are
all very high relative to the runtime.  Further, on many systems the
entire problem is able to fit within the last level cache of the CPU.
This can distort the numbers somewhat as kernels which are usually
memory bandwidth bound suddenly become FLOP bound.

As a starting point ~2500 third order hexahedral elements are usually
sufficient to amortise away any overheads and is reasonably realistic in
terms of the loading per CPU/GPU for a real-world simulation.

> 
> The results for the openmp backend tests follow:
>
> [snip]
> 
> The third case run shows that hyperthreading is a no-no as I’m sure
> you’re already aware. I was actually surprised that Apple’s Accelerate
> Framework was less performant than OpenBLAS, and I’ve convinced myself
> that gcc–4.9 (v4.9.2) is garbage. To install an OpenMP version of clang
> I used homebrew and this brew recipe
> <https://github.com/Homebrew/homebrew/pull/33278>. I also had to compile
> and install Intel’s OpenMP Runtime Library
> <https://www.openmprtl.org/download#stable-releases>. I downloaded the
> the version listed at the top of the table (Version 20140926), unpacked,
> and invoked make with make compiler=clang. Next, I moved the
> *.dylib and/*/.h files to their respective lib and include directories
> under /usr/local. Lastly, I set the C_INCLUDE_PATH, CPLUS_INCLUDE_PATH
> to include /usr/local/include and the DYLD_LIBRARY_PATH to include
> /usr/local/lib.

An important thing to consider when switching between cblas-st and
cblas-mt is what the underlying BLAS library is configured to do.
Passing a multi-threaded BLAS library to cblas-st will result in an
over-subscription of cores (the N BLAS calls will all themselves try to
launch N threads).

With OpenBLAS it is relatively simple to disable threading when the
library is compiled.  If compiling multi-threaded OpenBLAS there is a
choice between using OpenMP and its own threading code.

I am unsure if it is possible to get accelerate not to multi-thread on
its own.

With regards to the performance of GCC 4.9.  It would be interesting to
see if that carries forward when only a single thread is used (so
eliminating OpenMP library overheads and just comparing the quality of
the produced code).

> Now something has recently changed with either pycuda under OS X or
> PyFR, because initiating a similar test using the cuda backend results
> in the following traceback: 
> 
> pyfr-sim -p -n 100 -b cuda run couette_flow_2d.pyfrm couette_flow_2d.ini
> 
> Traceback (most recent call last): File
> "/Users/zdavis/Applications/PyFR/pyfr/scripts/pyfr-sim", line 112, in
> <module> main() File
> "/usr/local/lib/python2.7/site-packages/mpmath/ctx_mp.py", line 1301, in
> g return f(*args, **kwargs) File
> "/Users/zdavis/Applications/PyFR/pyfr/scripts/pyfr-sim", line 82, in
> main backend = get_backend(args.backend, cfg) File
> "/Users/zdavis/Applications/PyFR/pyfr/backends/__init__.py", line 11, in
> get_backend return subclass_where(BaseBackend, name=name.lower())(cfg)
> File "/Users/zdavis/Applications/PyFR/pyfr/backends/cuda/base.py", line
> 33, in __init__ from pycuda.autoinit import context File
> "/usr/local/lib/python2.7/site-packages/pycuda/autoinit.py", line 4, in
> <module> cuda.init() pycuda._driver.RuntimeError: cuInit failed: no device 
> 
> I remember when first installing and running PyFR (~v0.2) this worked
> just fine using the default backend.  I’m curious what has changed.

At a Python prompt try:

  import pycuda.autoinit

and let us know what the outcome is.  I believe some people are having
issues with CUDA/PyCUDA on Mac OS 10.10.

Regards, Freddie.

-- 
You received this message because you are subscribed to the Google Groups "PyFR 
Mailing List" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send an email to [email protected].
Visit this group at http://groups.google.com/group/pyfrmailinglist.
For more options, visit https://groups.google.com/d/optout.

Reply via email to