Re: [Rcpp-devel] OT: Boost NT

Simon Zehnder Tue, 15 Oct 2013 02:00:02 -0700

Darren,

this library looks interesting! Thank you for the link!

The user-friendly provision of tools to make development of high-performance 
code more easy seems to be a new trend: Lately Dirk mentioned yeppp! to me (I 
am always interested in such things), OpenMP 4.0 goes in the same direction and 
Intel provides special #pragma clauses and so called Intrinsics to enforce 
vectorization without the need to dive into Assembly code (see 
http://software.intel.com/en-us/articles/intel-intrinsics-guide). 
Parallelization using either the CPU or an Accelarator (e.g. GPUs) is not that 
new and OpenMP (http://www.openmp.org/mp-documents/OpenMP4.0.0.pdf) and OpenACC 
(http://www.openacc-standard.org, there is also CUDA and OpenCL, but you need 
much more code to make the same things happen) are well known to the most of us 
I assume. Though, some things are not considered by these standards: hardware 
specifics (excluding Intel Intrinsics). If you want to increase the throughput 
of your pipeline regarding a certain operation (e.g. a for loop) you should k
 now your pipeline (How deep is it? How many and which operations are possible 
in one cycle of your CPU? Usually you have to look up the hardware description 
of the producer of your Chip). Furthermore, data loading: Most systems today 
are so called non-unified memory access (NUMA) systems. That is, several cores 
share cache on a socket. It makes sense then to keep data, where it is needed 
(when code is for example parallelized) and to pin processes to certain cores 
so that they have not to share on-chip memory but can use it on their own. 
Methods like memory distribution and process pinning are used to achieve these 
objectives.  

Parallelizing code is a topic of its own: It makes only sense to parallelize a 
loop with a lot of iterations and even then often only with a few processes as 
you suffer under overload (creating the thread pool and closing it). Nested 
loops are even more problematic if you want to gain real efficiency. A very 
common rule is to parallelize code that has per iteration a lot of operations. 
However you decide, there remains the task to decide which variables are shared 
and which are private among the processes working your parallelized region. You 
have to be aware of data races (several processes accessing data with at least 
one of them with a write operation). Such errors are often very difficult do 
find and produce undefined behavior (like each time you run the loop you get a 
different result). The only tool so far with a success rate of 100% is to my 
knowledge the Intel Inspector (still something that can be bought by normal 
persons - TotalView is another story). 

At last, you can decide where a parallel region should be executed, as your 
computer has not only the CPU but also for example the graphic card (GPU). The 
graphic card has usually thousands of cores in a single chip. Each core can do 
operations. A drawback is though very limited caches and bandwidths (the caches 
are the on-chip memories and the bandwidth the channels data is loaded 
through.). So, operations with massive data loads or a lot of different data 
per operations are usually not well-parallelized on GPUs (in addition fast 
implementation for GPUs is only possible via OpenACC for Nvidia cards for 
others you must use CUDA or OpenCL).  

So, you can see each high-performance method has its advantages and drawbacks 
due to your hardware and your specific instructions in your code. The hardware 
can be checked very easily by a library and this is done as it seems by this 
new Boost library (which looks pretty interesting btw.). Still, there remain 
areas where even best parallelization is not really best as the library cannot 
know your specific data. How big is it? Will the data to be loaded fit into 
your cache? The Boost library concentrates (as does also yeppp! but in addition 
it avoids DIV and SQRT operations and relies on linear approximations - which 
are from a mathematical point of view really good) on very simple functionals 
and this could work well - it will be interesting to test it!

Is it possible to use all these methods/tools in R with C/C++ extensions: Yes 
it is, but you have to tell the compiler! Use OpenMP and OpenACC to parallelize 
code. Vectorization is yet a little bit more complicated, but wait for OpenMP 
4.0 - it has a simd pragma by which you can enforce vectorization. It has to be 
said, that vectorization is often done automatically by the compiler, but the 
compiler often fails to, especially if you have parallelized code. We tested at 
the high-performance computing cluster a simple parallelized loop to compute 
the constant phi: The Intel compiler did vectorize the operations inside the 
parallel region, the gcc failed (you can check vectorization via the compile 
flag -ftree-vectorizer-verbose=2). Compilers are not always as intelligent as 
we hope they are. 
At the end it remains to say: If a developer needs a code appropriate to her 
needs and high-performance she must think about hardware and software: Is there 
something that can be scaled? How does my pipeline looks like and how can I 
structure my code to use it? How big is my data and how much of it is used in 
each step of my code? Where is my bottleneck: bandwidth, caches or scalability? 
Testing the code and testing certain areas of it, fine-tuning parallelisation 
and nested parallelisation with different number of threads is the way how it 
is been done. 

To conclude: libraries and tools like Boost Numerical Template Toolbox or 
yeppp! can help you, but certainly not entirely and not in each problem. If it 
is needed (makes a better job for simple functionals than self-made solutions) 
remains to be tested. But, you have all the tools to tune your code yourself. 
In addition: Rcpp has the great advantage of using memory from R and to my 
experience this is one of the main reasons why C++ extensions from R perform 
often better than simple C++ programs not using R at all - in my case. Data has 
its origin and must be passed to the program. In my opinion this is done very 
good in R/C++ via the Rcpp API. You can of course always use the Boost library 
inside your extended C++ code and make use of the many functions. 

I hope this helps. 

Simon

On Oct 15, 2013, at 1:57 AM, Darren Cook <[email protected]> wrote:

> I was taking a look at this new C++ library, going for Boost review
> soon: (or maybe I misunderstood and only the SIMD sub-library is aiming
> to be in Boost)
>  http://nt2.metascale.org/doc/html/index.html
> 
> It looks a bit like trying to port Matlab to C++, with an emphasis on
> high-level definition of operations that are automatically parallelized.
> 
> I'd be interested to hear from the experts here if it is something that
> could usefully be made to work with Rcpp, or if it is perfect subset of
> what can already be done with Rcpp and R.
> 
> Darren
> 
> 
> 
> -- 
> Darren Cook, Software Researcher/Developer
> 
> http://dcook.org/work/ (About me and my work)
> http://dcook.org/blogs.html (My blogs and articles)
> _______________________________________________
> Rcpp-devel mailing list
> [email protected]
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/rcpp-devel

_______________________________________________
Rcpp-devel mailing list
[email protected]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/rcpp-devel

Re: [Rcpp-devel] OT: Boost NT

Reply via email to