Re: [Scikit-learn-general] Robust PCA

Alex Wed, 15 Apr 2015 09:55:01 -0700

Hi Andreas,
I have an implementation of the ALM method for Robust PCA from Candes using 
Jake Vanderplas' PyPROPACK.  It's in a private bitbucket repo but I will move 
it to github and send the link if you like.  I actually really wanted to 
contribute RPCA to sklearn.


I don't know about a PR but I found a while back someone wishing to add RPCA 
and maybe doing GSoC for it. 

Also, there's a slight variant on Robust PCA that apparently is more scalable.  
The paper is here:
http://www.icml-2011.org/papers/41_icmlpaper.pdf

I intend to explore some of the different methods for the low rank plus sparse 
problem.

Alex (no longer a lurker, it seems)

-----Original Message-----
From: "[email protected]" 
<[email protected]>
Sent: ‎4/‎15/‎2015 8:24 AM
To: "[email protected]" 
<[email protected]>
Subject: Scikit-learn-general Digest, Vol 63, Issue 28

Send Scikit-learn-general mailing list submissions to
        [email protected]

To subscribe or unsubscribe via the World Wide Web, visit
        https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
or, via email, send a message with subject or body 'help' to
        [email protected]

You can reach the person managing the list at
        [email protected]

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Scikit-learn-general digest..."


Today's Topics:

   1. Re: pydata (Andreas Mueller)
   2. Robust PCA (Andreas Mueller)
   3. Re: Robust PCA (Kyle Kastner)
   4. Performance of LSHForest (Miroslav Batchkarov)
   5. Re: Contributing to scikit-learn with a re-implementation of
      a Random Forest based iterative feature selection method
      (Andreas Mueller)


----------------------------------------------------------------------

Message: 1
Date: Wed, 15 Apr 2015 10:15:56 -0400
From: Andreas Mueller <[email protected]>
Subject: Re: [Scikit-learn-general] pydata
To: [email protected]
Message-ID: <[email protected]>
Content-Type: text/plain; charset="windows-1252"

PyData London is soon, not sure the date is official. It's end of June, 
I think.

In NYC I think I'm talking at a Python meetup at April 23rd.


On 04/14/2015 06:05 PM, Pagliari, Roberto wrote:
> Is there a pydata or sklearn workshop coming up in NYC or London?
>
> Thank you,
>
>
> ------------------------------------------------------------------------------
> BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT
> Develop your own process in accordance with the BPMN 2 standard
> Learn Process modeling best practices with Bonita BPM through live exercises
> http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual- event?utm_
> source=Sourceforge_BPM_Camp_5_6_15&utm_medium=email&utm_campaign=VA_SF
>
>
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

-------------- next part --------------
An HTML attachment was scrubbed...

------------------------------

Message: 2
Date: Wed, 15 Apr 2015 10:33:59 -0400
From: Andreas Mueller <[email protected]>
Subject: [Scikit-learn-general] Robust PCA
To: [email protected]
Message-ID: <[email protected]>
Content-Type: text/plain; charset=utf-8; format=flowed

Hey all.
Was there some plan to add Robust PCA at some point? I vaguely remember 
a PR, but maybe I'm making things up.
It sounds like a pretty cool model and is widely used:
Sparse
http://statweb.stanford.edu/~candes/papers/RobustPCA.pdf

[and I was just promised a good implementation]

Andy



------------------------------

Message: 3
Date: Wed, 15 Apr 2015 11:04:21 -0400
From: Kyle Kastner <[email protected]>
Subject: Re: [Scikit-learn-general] Robust PCA
To: [email protected]
Message-ID:
        <CAGNZ19C-_70uNq49_T+Rmey6=0dsh1sbrqvej2eypcepp4d...@mail.gmail.com>
Content-Type: text/plain; charset=UTF-8

Robust PCA is awesome - I would definitely like to see a good and fast
version. I had a version once upon a time, but it was neither good
*or* fast :)

On Wed, Apr 15, 2015 at 10:33 AM, Andreas Mueller <[email protected]> wrote:
> Hey all.
> Was there some plan to add Robust PCA at some point? I vaguely remember
> a PR, but maybe I'm making things up.
> It sounds like a pretty cool model and is widely used:
> Sparse
> http://statweb.stanford.edu/~candes/papers/RobustPCA.pdf
>
> [and I was just promised a good implementation]
>
> Andy
>
> ------------------------------------------------------------------------------
> BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT
> Develop your own process in accordance with the BPMN 2 standard
> Learn Process modeling best practices with Bonita BPM through live exercises
> http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual- event?utm_
> source=Sourceforge_BPM_Camp_5_6_15&utm_medium=email&utm_campaign=VA_SF
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general



------------------------------

Message: 4
Date: Wed, 15 Apr 2015 16:12:26 +0100
From: Miroslav Batchkarov <[email protected]>
Subject: [Scikit-learn-general] Performance of LSHForest
To: [email protected]
Message-ID: <[email protected]>
Content-Type: text/plain; charset="utf-8"

Hi everyone,

was really impressed by the speedups provided by LSHForest compared to 
brute-force search. Out of curiosity, I compared LSRForest to the existing ball 
tree implementation. The approximate algorithm is consistently slower (see 
below). Is this normal and should it be mentioned in the documentation? Does 
approximate search offer any benefits in terms of memory usage?


I ran the same example 
<http://scikit-learn.org/stable/auto_examples/neighbors/plot_approximate_nearest_neighbors_scalability.html#example-neighbors-plot-approximate-nearest-neighbors-scalability-py>
 with a algorithm=ball_tree. I also had to set metric=?euclidean? (this may 
affect results). The output is:

Index size: 1000, exact: 0.000s, LSHF: 0.007s, speedup: 0.0, accuracy: 1.00 
+/-0.00
Index size: 2511, exact: 0.001s, LSHF: 0.007s, speedup: 0.1, accuracy: 0.94 
+/-0.05
Index size: 6309, exact: 0.001s, LSHF: 0.008s, speedup: 0.2, accuracy: 0.92 
+/-0.07
Index size: 15848, exact: 0.002s, LSHF: 0.008s, speedup: 0.3, accuracy: 0.92 
+/-0.07
Index size: 39810, exact: 0.005s, LSHF: 0.010s, speedup: 0.5, accuracy: 0.84 
+/-0.10
Index size: 100000, exact: 0.008s, LSHF: 0.016s, speedup: 0.5, accuracy: 0.80 
+/-0.06

With n_candidates=100, the output is

Index size: 1000, exact: 0.000s, LSHF: 0.006s, speedup: 0.0, accuracy: 1.00 
+/-0.00
Index size: 2511, exact: 0.001s, LSHF: 0.006s, speedup: 0.1, accuracy: 0.94 
+/-0.05
Index size: 6309, exact: 0.001s, LSHF: 0.005s, speedup: 0.2, accuracy: 0.92 
+/-0.07
Index size: 15848, exact: 0.002s, LSHF: 0.007s, speedup: 0.4, accuracy: 0.90 
+/-0.11
Index size: 39810, exact: 0.005s, LSHF: 0.008s, speedup: 0.7, accuracy: 0.82 
+/-0.13
Index size: 100000, exact: 0.007s, LSHF: 0.013s, speedup: 0.6, accuracy: 0.78 
+/-0.04



---
Miroslav Batchkarov
PhD Student,
Text Analysis Group,
Department of Informatics,
University of Sussex



-------------- next part --------------
An HTML attachment was scrubbed...

------------------------------

Message: 5
Date: Wed, 15 Apr 2015 11:23:32 -0400
From: Andreas Mueller <[email protected]>
Subject: Re: [Scikit-learn-general] Contributing to scikit-learn with
        a re-implementation of a Random Forest based iterative feature
        selection method
To: [email protected]
Message-ID: <[email protected]>
Content-Type: text/plain; charset="windows-1252"

Hi Daniel.
That sounds potentially interesting.
Is there a widely cited paper for this?
I didn't read the paper, but it looks very similar to 
RFE(RandomForestClassifier()).
Is it qualitatively different from that? Does it use a different feature 
importance?

btw: your mail is flagged as spam as your link is broken and links to 
some imperial college internal page.

Cheers,
Andy

On 04/15/2015 05:03 AM, Daniel Homola wrote:
> Hi all,
>
> I needed a multivariate feature selection method for my work. As I'm 
> working with biological/medical data, where n < p or even n << p I 
> started to read up on Random Foretst based methods, as in my limited 
> understanding RF copes pretty well with this suboptimal situation.
>
> I came across an R package called Boruta:https://m2.icm.edu.pl/boruta/ 
> <https://exchange.imperial.ac.uk/owa/redir.aspx?C=Yp1dHGp6hkyiZQZzx17DHznOv7PxStIIK3PgwAs_McazihitoU3Fm6_EBXvwfIJB2CJSzkCKKjo.&URL=https%3a%2f%2fm2.icm.edu.pl%2fboruta%2f>
>
> After reading the paper and checking some of the pretty impressive 
> citations I thought I'd try it, but it was really slow. So I thought 
> I'll reimplement it in Python, because I hoped (based on 
> thishttp://www.slideshare.net/glouppe/accelerating-random-forests-in-scikitlearn
>  
> <https://exchange.imperial.ac.uk/owa/redir.aspx?C=Yp1dHGp6hkyiZQZzx17DHznOv7PxStIIK3PgwAs_McazihitoU3Fm6_EBXvwfIJB2CJSzkCKKjo.&URL=http%3a%2f%2fwww.slideshare.net%2fglouppe%2faccelerating-random-forests-in-scikitlearn>)
>  
> that it will be faster. And it is :) I mean a LOT faster..
>
> I was wondering if this would be something that you would consider 
> incorporating into the feature selection module of scikit-learn?
>
> If yes, do you have a tutorial or some sort of guidance about how 
> should I prepare the code, what conventions should I follow, etc?
>
> Cheers,
>
> Daniel Homola
>
> STRATiGRAD PhD Programme
> Imperial College London
>
>
> ------------------------------------------------------------------------------
> BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT
> Develop your own process in accordance with the BPMN 2 standard
> Learn Process modeling best practices with Bonita BPM through live exercises
> http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual- event?utm_
> source=Sourceforge_BPM_Camp_5_6_15&utm_medium=email&utm_campaign=VA_SF
>
>
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

-------------- next part --------------
An HTML attachment was scrubbed...

------------------------------

------------------------------------------------------------------------------
BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT
Develop your own process in accordance with the BPMN 2 standard
Learn Process modeling best practices with Bonita BPM through live exercises
http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual- event?utm_
source=Sourceforge_BPM_Camp_5_6_15&utm_medium=email&utm_campaign=VA_SF

------------------------------

_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


End of Scikit-learn-general Digest, Vol 63, Issue 28
****************************************************

------------------------------------------------------------------------------
BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT
Develop your own process in accordance with the BPMN 2 standard
Learn Process modeling best practices with Bonita BPM through live exercises
http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual- event?utm_
source=Sourceforge_BPM_Camp_5_6_15&utm_medium=email&utm_campaign=VA_SF

_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Robust PCA

Reply via email to