Greetings and Salutations,

I would suggest the following modifications:

1. Use the Rcpp omp plugin

// [[Rcpp::plugins(openmp)]

Instead of using set flags. (assuming you are on Rcpp >= 0.10.5 )

2. Modify the function parameters to include: int cores

This allows you to specify cores during run time vs. compile time.

3. Specify pragma directive such that it is:

#pragma omp parallel for num_threads(cores)

Or use:

omp_set_num_threads(cores);

The first is a more graceful fail if the system does not support openmp and 
overrides all set core values.

Regarding your questions:


1.       OpenMP will open up the requested number of threads. If you have a 
Parallel BLAS it will open up more OpenMP threads. This is problematic.
Consider:
A machine with 8 cores.
Default to using 4 cores to number of threads for the OpenMP problem.
Assume that the Parallel BLAS is using 2 cores…
Then, 4*2 = 8 cores are allocated for parallelization.

So, depending on your allocation, you probably will have “step over.”


2.       Reductions in OpenMP are generally only possible if you have: var  = 
var  op  expr (e.g. sum += x(i); )

var is a scalar (e.g. sum, the summed value)
op is the operator to apply (e.g. +, plus)
expr is a scalar that does not reference var (e.g. x(i), new value)

I’m confused as to whether you are referring to your final output e.g.    
Y.row(i) = yu.t(); as the reduction.

If this is the case, the object, Y, is being updated in shared memory. Since 
only one row is updated, this is fine.

Everything else within the for loop is considered as private to the instance 
since it is declared within the pragma.


With your journey into OpenMP, these might help:

Slides regarding OpenMP and RcppArmadillo:
http://www.thecoatlessprofessor.com/wp-content/uploads/2014/09/hpc_parallel.pdf

Demo code for using OpenMP with Armadillo & Eigen using the tapering idea in 
spatial statistics:
https://github.com/coatless/pims_bigdata

Sincerely,

JJB

From: rcpp-devel-boun...@lists.r-forge.r-project.org 
[mailto:rcpp-devel-boun...@lists.r-forge.r-project.org] On Behalf Of Saurabh B
Sent: Tuesday, May 26, 2015 4:53 PM
To: rcpp-devel@lists.r-forge.r-project.org
Subject: [Rcpp-devel] OpenMP and Parallel BLAS

Hi there,

I am using gradient descent to reduce a large matrix of users and items. For 
this I am trying to use all 40 available cores but unfortunately my performance 
is no better than when I was using just one. I am new to openMP and 
RcppArmadillo so pardon my ignorance.

The main loop is -
#pragma omp parallel for


  for (int u = 0; u < C.n_rows; u++) {


    arma::mat Cu = diagmat(C.row(u));


    arma::mat YTCuIY = Y.t() * (Cu) * Y;


    arma::mat YTCupu = Y.t() * (Cu + fact_eye) * P.row(u).t();


    arma::mat WuT = YTY + YTCuIY + lambda_eye;


    arma::mat xu = solve(WuT, YTCupu);





    // Update gradient -- maybe a slow operation in parallel?


    X.row(u) = xu.t();


  }



full code - 
https://github.com/sanealytics/recommenderlabrats/blob/master/src/implicit.cpp<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_sanealytics_recommenderlabrats_blob_master_src_implicit.cpp&d=AwMFaQ&c=8hUWFZcy2Z-Za5rBPlktOQ&r=Oj62bnDE1oueLU-seL9f0p1xxu4Hvw2JDuP8BUw91c8&m=VTzIWqHqUjsUEq0rJs9u6p5oJdEvwM5rSY7YlYmglGM&s=E57j1meIRKL8m500E49D3PRQ7bgpEv3BgvLJ2Qd6874&e=>

(implementing this paper - 
http://www.researchgate.net/profile/Yifan_Hu/publication/220765111_Collaborative_Filtering_for_Implicit_Feedback_Datasets/links/0912f509c579ddd954000000.pdf<https://urldefense.proofpoint.com/v2/url?u=http-3A__www.researchgate.net_profile_Yifan-5FHu_publication_220765111-5FCollaborative-5FFiltering-5Ffor-5FImplicit-5FFeedback-5FDatasets_links_0912f509c579ddd954000000.pdf&d=AwMFaQ&c=8hUWFZcy2Z-Za5rBPlktOQ&r=Oj62bnDE1oueLU-seL9f0p1xxu4Hvw2JDuP8BUw91c8&m=VTzIWqHqUjsUEq0rJs9u6p5oJdEvwM5rSY7YlYmglGM&s=jPEJ-O62i5EG_3FH3F3Rdbdj2F_pY3wSDEpb81j3Li0&e=>)

Matrices C, Y and P are large. Matrix X can be assumed to be small.

I have the following questions -
1) I have replaced my BLAS with OpenMP BLAS and am also using the "#pragma omp 
parallel for" clause. Will they step over each other or are they complimentary? 
I ask because my understanding is that the for loop will split each user across 
threads, then the BLAS will redistribute the matrices to multiply across all 
threads again. Is that right? And if so, is that what we want to do?

2) Since the threads are running in parallel and I just need the resulting 
value as output, I would ideally like a reduce() that gives each row in 
sequence and I can construct the new X from it. I am not sure how to go about 
doing that with Rcpp. I also want to avoid copying data as much as possible.

Looking forward to your input,
Saurabh


_______________________________________________
Rcpp-devel mailing list
Rcpp-devel@lists.r-forge.r-project.org
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/rcpp-devel

Reply via email to