Thank you both. I'm trying out these excellent suggestions and reading through the material. It really helps my understanding of how these two things work together. I will update with my findings for everyone's benefit.
To clarify my second question James, can multiple processes concurrently modify different segments of an arma::mat object? Does the entire matrix X lock in order to modify a section? If it is the latter, one way to get around that is to borrow this functional paradigm - users -> map -> myFunction() -> reduce() where myFunction() outputs the new X(u) for the given user. reduce() then combines all such values to form the new value of X (old is garbage collected). Since it emits a new matrix each time, the locking contention is not an issue. But if arma::mat can be modified concurrently, the above is not needed. I think that's what you allude to above. Thanks again, Saurabh On Tue, May 26, 2015 at 8:41 PM, Balamuta, James Joseph < balam...@illinois.edu> wrote: > Greetings and Salutations, > > > > I would suggest the following modifications: > > > 1. Use the Rcpp omp plugin > > > > // [[Rcpp::plugins(openmp)] > > > > Instead of using set flags. (assuming you are on Rcpp >= 0.10.5 ) > > > 2. Modify the function parameters to include: int cores > > This allows you to specify cores during run time vs. compile time. > > > > 3. Specify pragma directive such that it is: > > > > #pragma omp parallel for num_threads(cores) > > > > Or use: > > > > omp_set_num_threads(cores); > > > The first is a more graceful fail if the system does not support openmp > and overrides all set core values. > > > > Regarding your questions: > > > > 1. OpenMP will open up the requested number of threads. If you have > a Parallel BLAS it will open up more OpenMP threads. This is problematic. > Consider: > A machine with 8 cores. > Default to using 4 cores to number of threads for the OpenMP problem. > Assume that the Parallel BLAS is using 2 cores… > Then, 4*2 = 8 cores are allocated for parallelization. > > So, depending on your allocation, you probably will have “step over.” > > 2. Reductions in OpenMP are generally only possible if you have: > var = var op expr (e.g. sum += x(i); ) > > var is a scalar (e.g. sum, the summed value) > op is the operator to apply (e.g. +, plus) > expr is a scalar that does not reference var (e.g. x(i), new value) > > I’m confused as to whether you are referring to your final output e.g. > Y.row(i) = yu.t(); as the reduction. > > If this is the case, the object, Y, is being updated in shared memory. > Since only one row is updated, this is fine. > > Everything else within the for loop is considered as private to the > instance since it is declared within the pragma. > > > > With your journey into OpenMP, these might help: > > Slides regarding OpenMP and RcppArmadillo: > > http://www.thecoatlessprofessor.com/wp-content/uploads/2014/09/hpc_parallel.pdf > > Demo code for using OpenMP with Armadillo & Eigen using the tapering idea > in spatial statistics: > > https://github.com/coatless/pims_bigdata > > > > Sincerely, > > > JJB > > > > *From:* rcpp-devel-boun...@lists.r-forge.r-project.org [mailto: > rcpp-devel-boun...@lists.r-forge.r-project.org] *On Behalf Of *Saurabh B > *Sent:* Tuesday, May 26, 2015 4:53 PM > *To:* rcpp-devel@lists.r-forge.r-project.org > *Subject:* [Rcpp-devel] OpenMP and Parallel BLAS > > > > Hi there, > > > > I am using gradient descent to reduce a large matrix of users and items. > For this I am trying to use all 40 available cores but unfortunately my > performance is no better than when I was using just one. I am new to openMP > and RcppArmadillo so pardon my ignorance. > > > > The main loop is - > > #pragma omp parallel for > > for (int u = 0; u < C.n_rows; u++) { > > arma::mat Cu = diagmat(C.row(u)); > > arma::mat YTCuIY = Y.t() * (Cu) * Y; > > arma::mat YTCupu = Y.t() * (Cu + fact_eye) * P.row(u).t(); > > arma::mat WuT = YTY + YTCuIY + lambda_eye; > > arma::mat xu = solve(WuT, YTCupu); > > > > // Update gradient -- maybe a slow operation in parallel? > > X.row(u) = xu.t(); > > } > > > > > > full code - > https://github.com/sanealytics/recommenderlabrats/blob/master/src/implicit.cpp > <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_sanealytics_recommenderlabrats_blob_master_src_implicit.cpp&d=AwMFaQ&c=8hUWFZcy2Z-Za5rBPlktOQ&r=Oj62bnDE1oueLU-seL9f0p1xxu4Hvw2JDuP8BUw91c8&m=VTzIWqHqUjsUEq0rJs9u6p5oJdEvwM5rSY7YlYmglGM&s=E57j1meIRKL8m500E49D3PRQ7bgpEv3BgvLJ2Qd6874&e=> > > > > (implementing this paper - > http://www.researchgate.net/profile/Yifan_Hu/publication/220765111_Collaborative_Filtering_for_Implicit_Feedback_Datasets/links/0912f509c579ddd954000000.pdf > <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.researchgate.net_profile_Yifan-5FHu_publication_220765111-5FCollaborative-5FFiltering-5Ffor-5FImplicit-5FFeedback-5FDatasets_links_0912f509c579ddd954000000.pdf&d=AwMFaQ&c=8hUWFZcy2Z-Za5rBPlktOQ&r=Oj62bnDE1oueLU-seL9f0p1xxu4Hvw2JDuP8BUw91c8&m=VTzIWqHqUjsUEq0rJs9u6p5oJdEvwM5rSY7YlYmglGM&s=jPEJ-O62i5EG_3FH3F3Rdbdj2F_pY3wSDEpb81j3Li0&e=> > ) > > > > Matrices C, Y and P are large. Matrix X can be assumed to be small. > > > > I have the following questions - > > 1) I have replaced my BLAS with OpenMP BLAS and am also using the "#pragma > omp parallel for" clause. Will they step over each other or are they > complimentary? I ask because my understanding is that the for loop will > split each user across threads, then the BLAS will redistribute the > matrices to multiply across all threads again. Is that right? And if so, is > that what we want to do? > > > > 2) Since the threads are running in parallel and I just need the resulting > value as output, I would ideally like a reduce() that gives each row in > sequence and I can construct the new X from it. I am not sure how to go > about doing that with Rcpp. I also want to avoid copying data as much as > possible. > > > > Looking forward to your input, > > Saurabh > > > > >
_______________________________________________ Rcpp-devel mailing list Rcpp-devel@lists.r-forge.r-project.org https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/rcpp-devel