Hello,
The following study evaluates on the DARPA 1998 data set four outlier
detection algorithms :
Unserpervised SVM, LOF approach, NN approach and Mahalanobis-based approach
:
http://static.msi.umn.edu/rreports/2003/72.pdf
They find the LOF approach to be the more efficient, followed by the NN
approach,
unsupervised SVM and Mahalanobis-based approach.
However, like the other approaches involving a distance, the LOF algorithm
does not
scale well in high dimensional data, because of the effects of the data
getting spread
out sparsely (all the points become almost equidistant).
Aggarwal & Yu take this effect into consideration, and their evolutionary
algorithm scales
well in high dimensional data (and the complexity is almost linear with the
dimension,
and linear with the number of data).
It is the same for iForest, which doesn't rely on any distance.
Furthermore, empirical
evaluation of the authors shows that iForest outperforms ORCA, one-class
SVM,
and LOF in terms of AUC and processing times.
Regards,
Nicolas
2014-06-20 13:37 GMT+02:00 Alexandre Gramfort <
alexandre.gramf...@telecom-paristech.fr>:
> hi,
>
> Nicolas, could you give some numbers on the impact of these different works
> to get an idea of which work might have the highest interest for the
> sklearn community? do they all scale to medium or large datasets?
>
> is there anybody on the list with experience with these tools?
>
> Best,
> Alex
>
>
> On Fri, Jun 13, 2014 at 3:42 PM, Nicolas Goix <goix.nico...@gmail.com>
> wrote:
> > Hello,
> >
> > This is my first post to the list, I have been recently in touch with
> > Alexandre Gramfort, and I would be very interested in exploring some
> > outlier/anomaly detection algorithms, before eventually put it in a
> > compatible scikit learn API (with a view to eventually merge it).
> >
> > I'm not particularly aware of the state-of-the-art in the efficience of
> such
> > algorithms, I have just read some surveys and other litterature on it,
> and
> > my conclusion is that exploring the following classical methods would be
> > productive :
> >
> >
> > - density-based algorithms : LOF (Local Outlier Factor) and its
> variations
> > (other algorithms using relative density/k-NN) such as COF
> > (Connectivity-based Outlier Factor), ODIN (Outlier Detection using
> Indegree
> > Number), LOCI (Local Correlation Integral).
> >
> > LOF : http://www.dbs.ifi.lmu.de/Publikationen/Papers/LOF.pdf
> >
> > COF : http://www.cse.cuhk.edu.hk/~adafu/Pub/pakdd02.pdf
> >
> > ODIN : ftp://193.167.42.127/pub/franti/papers/Hautamaki/P2.pdf
> >
> > LOCI : http://www.dtic.mil/dtic/tr/fulltext/u2/a461085.pdf
> >
> >
> > - high-dimensional approach : « Aggarwal and Yu algorithm »
> >
> >
> http://www.researchgate.net/publication/2401320_Outlier_Detection_for_High_Dimensional_Data/file/e0b49525c3e5f60b5e.pdf
> >
> >
> > - iForest (Isolation Forest), which seems very interesting because it
> does
> > not rely on any distance or density measure.
> >
> >
> http://cs.nju.edu.cn/zhouzh/zhouzh.files/publication/icdm08b.pdf?q=isolation
> >
> >
> > So please let me know if some of these algorithms (or others) may
> generate a
> > particular interest.
> >
> > Anyway I'd be very glad to get any feedback on it.
> >
> >
> > Cheers,
> >
> > Nicolas
> >
> >
> >
> ------------------------------------------------------------------------------
> > HPCC Systems Open Source Big Data Platform from LexisNexis Risk Solutions
> > Find What Matters Most in Your Big Data with HPCC Systems
> > Open Source. Fast. Scalable. Simple. Ideal for Dirty Data.
> > Leverages Graph Analysis for Fast Processing & Easy Data Exploration
> > http://p.sf.net/sfu/hpccsystems
> > _______________________________________________
> > Scikit-learn-general mailing list
> > Scikit-learn-general@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
> >
>
>
> ------------------------------------------------------------------------------
> HPCC Systems Open Source Big Data Platform from LexisNexis Risk Solutions
> Find What Matters Most in Your Big Data with HPCC Systems
> Open Source. Fast. Scalable. Simple. Ideal for Dirty Data.
> Leverages Graph Analysis for Fast Processing & Easy Data Exploration
> http://p.sf.net/sfu/hpccsystems
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
------------------------------------------------------------------------------
HPCC Systems Open Source Big Data Platform from LexisNexis Risk Solutions
Find What Matters Most in Your Big Data with HPCC Systems
Open Source. Fast. Scalable. Simple. Ideal for Dirty Data.
Leverages Graph Analysis for Fast Processing & Easy Data Exploration
http://p.sf.net/sfu/hpccsystems
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general