hi,

Nicolas, could you give some numbers on the impact of these different works
to get an idea of which work might have the highest interest for the
sklearn community? do they all scale to medium or large datasets?

is there anybody on the list with experience with these tools?

Best,
Alex


On Fri, Jun 13, 2014 at 3:42 PM, Nicolas Goix <goix.nico...@gmail.com> wrote:
> Hello,
>
> This is my first post to the list, I have been recently in touch with
> Alexandre Gramfort, and I would be very interested in exploring some
> outlier/anomaly detection algorithms, before eventually put it in a
> compatible scikit learn API (with a view to eventually merge it).
>
> I'm not particularly aware of the state-of-the-art in the efficience of such
> algorithms, I have just read some surveys and other litterature on it, and
> my conclusion is that exploring the following classical methods would be
> productive :
>
>
> - density-based algorithms : LOF (Local Outlier Factor) and its variations
> (other algorithms using relative density/k-NN) such as COF
> (Connectivity-based Outlier Factor), ODIN (Outlier Detection using Indegree
> Number), LOCI (Local Correlation Integral).
>
> LOF : http://www.dbs.ifi.lmu.de/Publikationen/Papers/LOF.pdf
>
> COF : http://www.cse.cuhk.edu.hk/~adafu/Pub/pakdd02.pdf
>
> ODIN : ftp://193.167.42.127/pub/franti/papers/Hautamaki/P2.pdf
>
> LOCI : http://www.dtic.mil/dtic/tr/fulltext/u2/a461085.pdf
>
>
> - high-dimensional approach :  « Aggarwal and Yu algorithm »
>
> http://www.researchgate.net/publication/2401320_Outlier_Detection_for_High_Dimensional_Data/file/e0b49525c3e5f60b5e.pdf
>
>
> -  iForest (Isolation Forest), which seems very interesting because it does
> not rely on any distance or density measure.
>
> http://cs.nju.edu.cn/zhouzh/zhouzh.files/publication/icdm08b.pdf?q=isolation
>
>
> So please let me know if some of these algorithms (or others) may generate a
> particular interest.
>
> Anyway I'd be very glad to get any feedback on it.
>
>
> Cheers,
>
> Nicolas
>
>
> ------------------------------------------------------------------------------
> HPCC Systems Open Source Big Data Platform from LexisNexis Risk Solutions
> Find What Matters Most in Your Big Data with HPCC Systems
> Open Source. Fast. Scalable. Simple. Ideal for Dirty Data.
> Leverages Graph Analysis for Fast Processing & Easy Data Exploration
> http://p.sf.net/sfu/hpccsystems
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>

------------------------------------------------------------------------------
HPCC Systems Open Source Big Data Platform from LexisNexis Risk Solutions
Find What Matters Most in Your Big Data with HPCC Systems
Open Source. Fast. Scalable. Simple. Ideal for Dirty Data.
Leverages Graph Analysis for Fast Processing & Easy Data Exploration
http://p.sf.net/sfu/hpccsystems
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to