On 26 May 2015 at 17:53, Saurabh B wrote:
| Hi there,
| 
| I am using gradient descent to reduce a large matrix of users and items. For
| this I am trying to use all 40 available cores but unfortunately my 
performance
| is no better than when I was using just one. I am new to openMP and
| RcppArmadillo so pardon my ignorance.
| 
| The main loop is -
| 
| #pragma omp parallel
| for
|                         for (int u = 0; u < C.n_rows; u++) {
|                         arma::mat Cu = diagmat(C.row(u));
|                         arma::mat YTCuIY = Y.t() * (Cu) * Y;
|                         arma::mat YTCupu = Y.t() * (Cu + fact_eye) * 
P.row(u).t
|                         ();
|                         arma::mat WuT = YTY + YTCuIY + lambda_eye;
|                         arma::mat xu = solve(WuT, YTCupu);
|                         // Update gradient -- maybe a slow operation in
|                         parallel?
|                         X.row(u) = xu.t();
|                         }
| 
| 
| 
| full code - https://github.com/sanealytics/recommenderlabrats/blob/master/src/
| implicit.cpp
| 
| (implementing this paper - http://www.researchgate.net/profile/Yifan_Hu/
| publication/220765111_Collaborative_Filtering_for_Implicit_Feedback_Datasets/
| links/0912f509c579ddd954000000.pdf)
| 
| Matrices C, Y and P are large. Matrix X can be assumed to be small.
| 
| I have the following questions -
| 1) I have replaced my BLAS with OpenMP BLAS and am also using the "#pragma omp
| parallel for" clause. Will they step over each other or are they 
complimentary?

They will step over each other.

IIRC correctly eg the Rhpc package allows you to control threads.  It has
been a while since I worked with OpenMP but when I think I did I "downgraded"
to Atlas to avoid the interference.  

| I ask because my understanding is that the for loop will split each user 
across
| threads, then the BLAS will redistribute the matrices to multiply across all
| threads again. Is that right? And if so, is that what we want to do?

Probably not.

| 2) Since the threads are running in parallel and I just need the resulting
| value as output, I would ideally like a reduce() that gives each row in
| sequence and I can construct the new X from it. I am not sure how to go about
| doing that with Rcpp. I also want to avoid copying data as much as possible.

It is an interesting problem. Sorry I can't be of more help.

Dirk

| Looking forward to your input,
| Saurabh
| 
| 
| _______________________________________________
| Rcpp-devel mailing list
| Rcpp-devel@lists.r-forge.r-project.org
| https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/rcpp-devel

-- 
http://dirk.eddelbuettel.com | @eddelbuettel | e...@debian.org
_______________________________________________
Rcpp-devel mailing list
Rcpp-devel@lists.r-forge.r-project.org
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/rcpp-devel

Reply via email to