On Tue, May 01, 2018 at 12:48:17AM +0000, Germán Lancioni wrote: > Hi MLPACK team, > > Congratulations for the fantastic work you are doing. After trying several > options, I find MLPACK the most professional, well written and maintained > framework for ML. I have been using Naive Bayes, Random Forest, K-foldCv and > model saving. Now I'm wondering if MLPACK somehow supports one-class > classification (e.g. one-class SVM), as I have an anomaly detection problem > at hand. I tried going through the API doc but couldn't find anything in that > regard. > > I appreciate any input, and again cheers for the outstanding work.
Hi Germán, Thanks for the nice words about mlpack. I'm glad that you've found it useful. At the moment, we don't have any out-of-the-box one-class classification techniques implemented. However, at its core anomaly detection could also be expressed as the following question: assuming my data came from some distribution D, how likely is it that my point came from D or did not? More specifically, if we have some probability density function estimate p(x | D) for some point x, we can then do something like saying "if p(x | D) < threshold, then it is an anomaly". So, building on that idea, there may be a few things you can use out of mlpack: * You could use density estimation trees if your data is low-dimensional to compute densities of points. That code is found in src/mlpack/methods/det/. * KDE (kernel density estimation) is being implemented now in https://github.com/mlpack/mlpack/pull/1301, and you could use that to do much the same thing. I think in its current state it should work, but maybe easier to wait until it is done and merged. * You could build a feedforward network autoencoder and use, e.g., the MSE of the reconstructed image as a measure of anomaly. Here's a similar example: https://shiring.github.io/machine_learning/2017/05/01/fraud * Use k-furthest-neighbors to compute the mean kFN distance for a point? That *could* be a measure of outlier-ness. * This is a little different, but maybe you could use DBSCAN with a properly tuned radius, and points that get classified as "noise" (i.e. they are far away from any cluster) could be considered anomalies. * The last option (probably the most time consuming) would be to implement a technique and then we can merge it into mlpack so long as it's well tested and fast. :) Maybe some of these things work for your situation, maybe not. In any case I hope that the ideas are helpful, and let me know if I can clarify anything. -- Ryan Curtin | "This is how Number One works!" [email protected] | - Number One _______________________________________________ mlpack mailing list [email protected] http://knife.lugatgt.org/cgi-bin/mailman/listinfo/mlpack
