Tuesday, 4 November 2014

Hi Freddie & Peter,

Thanks for your input.  It’s these nuances regarding PyFR’s use that make using 
it all the more familiar as I uncover them, so thanks for highlighting where my 
experiment may have gone astray.  Peter, I took a look at the paper you 
provided when it was first announced, so I’m familiar with what sort of 
comparable performance to expect—I guess I was hoping to realize these same 
trends myself as a means to get more acquainted with PyFR.  Both of your input 
and experience has helped in that regard.

With regards to pycuda on OS 10.10, importing the pucuda.autoinit module gives 
me the following stack trace:

Python 2.7.8 (default, Oct 17 2014, 18:21:39) 
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.51)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import pycuda.autoinit
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/site-packages/pycuda/autoinit.py", line 4, in 
<module>
    cuda.init()
pycuda._driver.RuntimeError: cuInit failed: no device

This is with a meager NVIDIA GT 650m (1024 MB VRAM) card; thus, why I resorted 
to using the simple 2D examples.  I’ve got a Tesla K5000 in a workstation right 
next to me, so perhaps I’ll use it for some more testing; though, I was 
specifically interested in setting up and running PyFR under OS X, which I 
realize is probably a very niche use case.  I was particularly interested in 
the OpenMP backend, because I can imagine someone may have a model that 
wouldn’t fit on the available memory of the GPUs we provide currently, so 
better understanding the performance trade-off of running on a GPU cluster as 
opposed to a more traditional cluster of CPUs was worthwhile to me.

Best Regards,



Zach Davis
                Rescale, Inc.
589 Howard Street, Ste. 2
San Francisco, CA 94105

[email protected]
P: (855) 737-2253

> On Nov 4, 2014, at 2:35 AM, Freddie Witherden <[email protected]> wrote:
> 
> Hi Zach,
> 
> On 04/11/14 02:34, Zach Davis wrote:
>> I’m still looking into the OpenCL backend, but I think I was finally
>> able to get the OpenMP backend up and running under OS X. I’ve collected
>> a few *relative* performance benchmark results that perhaps some in the
>> community might be interested in. Ideally, I would like to apply this
>> test to see similar results for both CUDA and OpenCL backends.
>> 
>> My test system was a modest 2.4 GHz (i7–3630QM) quad-core Intel Core i7
>> Ivy Bridge processor with 16GB 1600 MHz DDR3 RAM running OS X v10.10. I
>> modified the compiler flags in
>> ${PYFR_ROOT}/pyfr/backends/openmp/compiler.py replacing the
>> -march=native option with -mtune=native. This change isn’t necessary for
>> the clang-omp compiler, but was necessary for the gcc–4.9 compiler I
>> tried. To keep things consistent, I left that option changed across
>> compilers. I took the fastest run in my test matrix (i.e. the very last
>> case) and re-ran the test while reverting the -mtune=native option back
>> to -march=native and observed no change in runtime.
>> 
>> I ran the couette_flow_2d example case using a single partition
>> initiating pyfr-sim as follows:
>> 
>> pyfr-sim -p -n 100 -b openmp run couette_flow_2d.pyfrm couette_flow_2d.ini
>> 
> 
> At Peter noted that Couette flow case isn't great for benchmarking
> purposes.  It has relatively few elements, all of which are 2D.  As a
> consequence overheads (from Python, starting/joining threads, BLAS) are
> all very high relative to the runtime.  Further, on many systems the
> entire problem is able to fit within the last level cache of the CPU.
> This can distort the numbers somewhat as kernels which are usually
> memory bandwidth bound suddenly become FLOP bound.
> 
> As a starting point ~2500 third order hexahedral elements are usually
> sufficient to amortise away any overheads and is reasonably realistic in
> terms of the loading per CPU/GPU for a real-world simulation.
> 
>> 
>> The results for the openmp backend tests follow:
>> 
>> [snip]
>> 
>> The third case run shows that hyperthreading is a no-no as I’m sure
>> you’re already aware. I was actually surprised that Apple’s Accelerate
>> Framework was less performant than OpenBLAS, and I’ve convinced myself
>> that gcc–4.9 (v4.9.2) is garbage. To install an OpenMP version of clang
>> I used homebrew and this brew recipe
>> <https://github.com/Homebrew/homebrew/pull/33278>. I also had to compile
>> and install Intel’s OpenMP Runtime Library
>> <https://www.openmprtl.org/download#stable-releases>. I downloaded the
>> the version listed at the top of the table (Version 20140926), unpacked,
>> and invoked make with make compiler=clang. Next, I moved the
>> *.dylib and/*/.h files to their respective lib and include directories
>> under /usr/local. Lastly, I set the C_INCLUDE_PATH, CPLUS_INCLUDE_PATH
>> to include /usr/local/include and the DYLD_LIBRARY_PATH to include
>> /usr/local/lib.
> 
> An important thing to consider when switching between cblas-st and
> cblas-mt is what the underlying BLAS library is configured to do.
> Passing a multi-threaded BLAS library to cblas-st will result in an
> over-subscription of cores (the N BLAS calls will all themselves try to
> launch N threads).
> 
> With OpenBLAS it is relatively simple to disable threading when the
> library is compiled.  If compiling multi-threaded OpenBLAS there is a
> choice between using OpenMP and its own threading code.
> 
> I am unsure if it is possible to get accelerate not to multi-thread on
> its own.
> 
> With regards to the performance of GCC 4.9.  It would be interesting to
> see if that carries forward when only a single thread is used (so
> eliminating OpenMP library overheads and just comparing the quality of
> the produced code).
> 
>> Now something has recently changed with either pycuda under OS X or
>> PyFR, because initiating a similar test using the cuda backend results
>> in the following traceback: 
>> 
>> pyfr-sim -p -n 100 -b cuda run couette_flow_2d.pyfrm couette_flow_2d.ini
>> 
>> Traceback (most recent call last): File
>> "/Users/zdavis/Applications/PyFR/pyfr/scripts/pyfr-sim", line 112, in
>> <module> main() File
>> "/usr/local/lib/python2.7/site-packages/mpmath/ctx_mp.py", line 1301, in
>> g return f(*args, **kwargs) File
>> "/Users/zdavis/Applications/PyFR/pyfr/scripts/pyfr-sim", line 82, in
>> main backend = get_backend(args.backend, cfg) File
>> "/Users/zdavis/Applications/PyFR/pyfr/backends/__init__.py", line 11, in
>> get_backend return subclass_where(BaseBackend, name=name.lower())(cfg)
>> File "/Users/zdavis/Applications/PyFR/pyfr/backends/cuda/base.py", line
>> 33, in __init__ from pycuda.autoinit import context File
>> "/usr/local/lib/python2.7/site-packages/pycuda/autoinit.py", line 4, in
>> <module> cuda.init() pycuda._driver.RuntimeError: cuInit failed: no device 
>> 
>> I remember when first installing and running PyFR (~v0.2) this worked
>> just fine using the default backend.  I’m curious what has changed.
> 
> At a Python prompt try:
> 
>  import pycuda.autoinit
> 
> and let us know what the outcome is.  I believe some people are having
> issues with CUDA/PyCUDA on Mac OS 10.10.
> 
> Regards, Freddie.
> 
> -- 
> You received this message because you are subscribed to the Google Groups 
> "PyFR Mailing List" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to [email protected].
> To post to this group, send an email to [email protected].
> Visit this group at http://groups.google.com/group/pyfrmailinglist.
> For more options, visit https://groups.google.com/d/optout.

-- 
You received this message because you are subscribed to the Google Groups "PyFR 
Mailing List" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send an email to [email protected].
Visit this group at http://groups.google.com/group/pyfrmailinglist.
For more options, visit https://groups.google.com/d/optout.

Reply via email to