Excellent work, thanks very much.

Dale Smith, Ph.D.
Senior Financial Quantitative Analyst
Financial & Risk Management Solutions
Fiserv
Office: 678-375-5315
www.fiserv.com<http://www.fiserv.com/>

From: rcpp-devel-boun...@r-forge.wu-wien.ac.at 
[mailto:rcpp-devel-boun...@r-forge.wu-wien.ac.at] On Behalf Of Qiang Kou
Sent: Thursday, July 24, 2014 8:25 PM
To: rcpp-devel@lists.r-forge.r-project.org
Subject: [Rcpp-devel] New package: RcppMLPACK, integration with MLPACK using 
Rcpp

RcppMLPACK is almost done, and I really hope it is useful for other people. 
Testing and bug report are deeply welcome. Not only the code, also the results. 
Now you can try it from my repo: https://github.com/thirdwing/RcppMLPACK

I am afraid there will be known problems on Windows about size_t type.

MLPACK is a scalable C++ machine learning library providing an intuitive and 
simple API. It implements a wide array of machine learning methods and uses 
Armadillo as input/output. For more detail about MLPACK, please visit its 
homepage: http://www.mlpack.org/

Since we have Rcpp and RcppArmadillo, which can integrate C++ and Armadillo 
with R seamlessly, RcppMLPACK becomes something very natural. The RcppMLPACK 
package includes the source code from the MLPACK library. Thus users do not 
need to install MLPACK itself in order to use RcppMLPACK.

I use k-means as an example. By using RcppMLPACK, a k-means method can be 
implemented like below. The interfere between R and C++ is handled by Rcpp and 
RcppArmadillo.

#include "RcppMLPACK.h"

using namespace mlpack::kmeans;
using namespace Rcpp;

// [[Rcpp::export]]
List kmeans(const arma::mat& data, const int& clusters) {

    arma::Col<size_t> assignments;

    // Initialize with the default arguments.
    KMeans<> k;

    k.Cluster(data, clusters, assignments);

    return List::create(_["clusters"] = clusters,
                        _["result"]   = assignments);
}

inline package provides a complete wrapper around the compilation, linking, and 
loading steps. So all the steps can be done in an R session. There is no reason 
that RcppMLPACK doesn't support the inline compilation.

library(inline)
library(RcppMLPACK)
code <- '
  arma::mat data = as<arma::mat>(test);
  int clusters = as<int>(n);
  arma::Col<size_t> assignments;
  mlpack::kmeans::KMeans<> k;
  k.Cluster(data, clusters, assignments);
  return List::create(_["clusters"] = clusters,
                      _["result"]   = assignments);
'
mlKmeans <- cxxfunction(signature(test="numeric", n ="integer"), code, 
plugin="RcppMLPACK")
data(trees, package="datasets")
mlKmeans(t(trees), 3)

There is one point we need to pay attention to: Armadillo matrices in MLPACK 
are stored in a column-major format for speed. That means observations are 
stored as columns and dimensions as rows.So when using MLPACK, additional 
transpose may be needed.

The package also contains a RcppMLPACK.package.skeleton() function for people 
who want to use MLPACK code in their own package. It follows the structure of 
RcppArmadillo.package.skeleton().

library(RcppMLPACK)
RcppMLPACK.package.skeleton("foobar")
Creating directories ...
Creating DESCRIPTION ...
Creating NAMESPACE ...
Creating Read-and-delete-me ...
Saving functions and data ...
Making help files ...
Done.
Further steps are described in './foobar/Read-and-delete-me'.

Adding RcppMLPACK settings
 >> added Imports: Rcpp
 >> added LinkingTo: Rcpp, RcppArmadillo, BH, RcppMLPACK
 >> added useDynLib and importFrom directives to NAMESPACE
 >> added Makevars file with RcppMLPACK settings
 >> added Makevars.win file with RcppMLPACK settings
 >> added example src file using MLPACK classes
 >> invoked Rcpp::compileAttributes to create wrappers

system("ls -R foobar")
foobar:
DESCRIPTION  man  NAMESPACE  R  Read-and-delete-me  src

foobar/man:
foobar-package.Rd

foobar/R:
RcppExports.R

foobar/src:
kmeans.cpp  Makevars  Makevars.win  RcppExports.cpp

Even without a performance testing, we are still sure the C++ implementations 
should be faster. A small wine data set from UCI data sets repository is used 
for benchmarking. A script using rbenchmark package is written as below:

suppressMessages(library(rbenchmark))
res <- benchmark(mlKmeans(t(wine),3),
                 kmeans(wine,3),
                 columns=c("test", "replications", "elapsed",
                 "relative", "user.self", "sys.self"), order="relative")

For 100 replications, MLPACK version of k-means (0.028s) is 33-time faster than 
kmeans in R (0.947s). However, we should note that R returns more information 
than the clustering result and there are much more checking functions in R.

There is an important problem in MLPACK: it uses size_t type heavily.

There will be problems in wrapping such type, since in 64-bit Windows, size_t 
is defined as unsigned long long int. No this kind of error found during 
testing on my Ubuntu.

Testing and bug report are deeply welcome. Not only the code, also the results.

Best,

KK
--
Qiang Kou
q...@umail.iu.edu<mailto:q...@umail.iu.edu>
School of Informatics and Computing, Indiana University

_______________________________________________
Rcpp-devel mailing list
Rcpp-devel@lists.r-forge.r-project.org
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/rcpp-devel

Reply via email to