Re: [Scikit-learn-general] Proposal for GSoC: Dimensionality reduction and features selection

Luca Puggini Thu, 26 Mar 2015 09:58:29 -0700

Dear all,
I have updated the proposal
https://docs.google.com/document/d/1nnrAsEfkXpGRlc_PMEeuUUQ1ZcNMfy-7M-dZcIVp2lU/edit?usp=sharing
following your advices.
I have reduced the number of proposed algorithms and I have tried to
explain better why we need them and how we can implement them.


The deadline is tomorrow but I am happy to accept last minutes changes.

Thanks a lot,
Luca

On Wed, Mar 25, 2015 at 8:55 PM, Michael Eickenberg <
michael.eickenb...@gmail.com> wrote:

> Hi Luca,
>
> thanks for your gsoc proposal. The proposed topics look interesting as
> such, but I am having a hard time following the planning: A more
> fine-grained timeline than 3-4 weeks per sub-project would be very helpful.
> As Andy says, code review and revisions take time which should be allocated
> probably as a multiple of coding time, especially if you are not 100%
> familiar with coding conventions yet.
>
> Next it would be helpful if you could motivate more precisely why we need
> extra algorithms for PCA and Sparse-PCA. Make sure to include in this
> analysis what the existing implementations are based on and what the new
> implementations add and in which regime of matrix size they show their full
> power. Note that sklearn.decomposition.IncrementalPCA is actually pretty
> good at handling very large datasets in both dimensions and
> sklearn.decomposition.RandomizedPCA is a useful approximation.
>
> As for Multitask Lasso, it is already implemented using a coordinate
> descent algorithm. In my mind this is a very specific algorithm with
> restricted practical applicability - please do correct me if I am wrong. I
> was surprised to see this in the code base before a simple group lasso was
> added. Has it proven useful in practice, outside MEG source reconstruction?
> As for documentation, yes, that can be extended, but here is an example:
> http://scikit-learn.org/stable/auto_examples/linear_model/plot_multi_task_lasso_support.html
>
> Tendentially, I would rather leave the aforementioned algorithms out,
> unless you can strongly motivate the benefits of the proposed
> PCA/Sparse-PCA algorithms, and focus on the other algorithms by elaborating
> a description of how they would fit in the code base and which algorithms
> that are already in the codebase they can be related to.
>
> As Andy mentions, you should also provide evaluation metrics that go along
> with the proposed algorithms.
>
> Further:
>
> >> Feature Subset Selection and Ranking for Data Dimensionality Reduction
> >> seems borderline with 120 cites since 2007.
> >>
> >
> > I think that this one should be included. In my opinion this is very
> > useful. I did some research about that and there are numerous conferences
> > paper with different algorithms that lead to its same results.
> > In other words numerous people from numerous fields are using this
> > algorithm without knowing it.
>
> Please list these in your proposal.
>
>
> Thanks,
> Michael
>
>
> On Wed, Mar 25, 2015 at 3:23 PM, Luca Puggini <lucapug...@gmail.com>
> wrote:
>
>> Dear All,
>> following some of the advices I have modified my proposal
>> https://docs.google.com/document/d/1gCHUKsfvii1sUQW-4E4dpbpWkmTPAg6WpLUcWbu4vk0/edit?usp=sharing
>>
>> I am now subscribed on the full ML and so I will try to keep all the
>> conversation in the same thread.
>>
>> Let me know what do you think about that.
>>
>> I am open to any comment, advice or suggestion.
>>
>> Thanks,
>> Luca
>>
>> -----------------------------------------------------------------------
>>
>> Hi Lucas,
>>
>> Instead of creating a new thread every time, it would be nice if you could
>> reply directly in the same thread. This would make the discussion easier
>> to
>> follow.
>>
>> (To do so you need to be fully subscribed to the ML. I'm guessing you may
>> be subscribed to the digest version)
>>
>> Thanks,
>> M.
>>
>> On Wed, Mar 25, 2015 at 9:16 AM, Luca Puggini <lucapug...@gmail.com>
>> wrote:
>>
>> > Hi guys,
>> > thanks for the interest.
>> >
>> > Some comments below
>> >
>> > Message: 1
>> >
>> >> Date: Tue, 24 Mar 2015 16:32:40 -0400
>> >> From: Andy <t3k...@gmail.com>
>> >> Subject: Re: [Scikit-learn-general] My personal suggestion regarding
>> >>         topics for GSoC (and my official application :-) )
>> >> To: scikit-learn-general@lists.sourceforge.net
>> >> Message-ID: <5511c9e8.2080...@gmail.com>
>> >> Content-Type: text/plain; charset="windows-1252"
>> >>
>> >> Hi Luca.
>> >> If you give write comment permissions, I could comment on the google
>> doc
>> >> in-place which might be helpfu
>> >
>> >
>> > Now you have edit privilege. Let me know if you any problem.
>> >
>> >
>> >> l.
>> >> As I think was commented earlier, the current PLS already implements
>> >> NIPALS. What would the addition be?
>> >> Use that in PCA? That is not super clear from the proposal.
>> >>
>> >
>> > Yes I was thinking for PCA. I can state it more clearly.
>> >
>> >
>> >> I think implementing this together with the other paper you mention
>> will
>> >> take more than one or two weeks.
>> >> Please keep in mind that it needs tests, documentation, examples and
>> >> reviews.
>> >>
>> >> I do not know how much time we need to write sklearn-quality code. If
>> you
>> > think that we need more time I trust you. :)
>> >
>> >
>> >> The "massive parallel" paper only has 8 citations since 2013. That
>> seems
>> >> pretty low impact and not very established.
>> >>
>> >
>> > I am happy to remove it from the list.
>> >
>> >
>> >> Unsupervised Feature Selection Using Feature Similarity seems a much
>> >> safer bet (800 cites since 2002), though I am not
>> >> familiar enough with the area to say if it is still comparable to state
>> >> of the art or useful.
>> >>
>> >
>> > Very difficult to say. I may look more in the details of this.
>> >
>> >
>> >> Feature Subset Selection and Ranking for Data Dimensionality Reduction
>> >> seems borderline with 120 cites since 2007.
>> >>
>> >
>> > I think that this one should be included. In my opinion this is very
>> > useful. I did some research about that and there are numerous
>> conferences
>> > paper with different algorithms that lead to its same results.
>> > In other words numerous people from numerous fields are using this
>> > algorithm without knowing it.
>> >
>> > I haven't actually had time to check the papers (yet?), this is just a
>> >> first very superficial review.
>> >>
>> >> Instead of focusing on many algorithms, I think you should also
>> allocate
>> >> some time to ensure that we have good evaluation metrics and
>> >> cross-validation support
>> >
>> > for multi-output algorithms where Y might be an input to transform (not
>> >> sure for how many of these algorithms this is the case).
>> >>
>> >>
>> > Yes this is an important point. I am happy to delete some of them. We
>> can
>> > choose with the help of the community what to keep and what not.
>> >
>> >
>> >
>> >> How is the multi-task lasso that you are proposing different from the
>> >> one implemented already in scikit-learn?
>> >>
>> >>
>> http://scikit-learn.org/dev/modules/generated/sklearn.linear_model.MultiTaskLasso.html#sklearn.linear_model.MultiTaskLasso
>> >>
>> >
>> > I was not aware of the presence of multitask lasso in sklearn. On the
>> > documentation there is the link to any reference. Maybe this is exactly
>> > equivalent to the proposed method and in this case the paper may be
>> used as
>> > reference.
>> >
>> >
>> >
>> >> The project sounds great, the hardest part might be finding the right
>> >> mentor (Gael?)
>> >>
>> >>
>> > I am glad to hear that. Let's see if we can find a supervisor.
>> >
>> >
>> >
>> >
>> >
>> >> Cheers,
>> >> Andy
>> >>
>> >>
>> >>
>> > In addition I think that a general hill climbing algorithm can be useful
>> > in sklearn.  A lot of algorithms can be defined as an hill climbing
>> > minimization problem where  customs initial state, neighbours function
>> and
>> > cost function are provided.
>> >
>> >
>> >
>> > Let me know if you have more advices or comments.
>> > Thanks a lot,
>> > Luca
>> >
>>
>>
>> ------------------------------------------------------------------------------
>> Dive into the World of Parallel Programming The Go Parallel Website,
>> sponsored
>> by Intel and developed in partnership with Slashdot Media, is your hub
>> for all
>> things parallel software development, from weekly thought leadership
>> blogs to
>> news, videos, case studies, tutorials and more. Take a look and join the
>> conversation now. http://goparallel.sourceforge.net/
>> _______________________________________________
>> Scikit-learn-general mailing list
>> Scikit-learn-general@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>>
>
>
> ------------------------------------------------------------------------------
> Dive into the World of Parallel Programming The Go Parallel Website,
> sponsored
> by Intel and developed in partnership with Slashdot Media, is your hub for
> all
> things parallel software development, from weekly thought leadership blogs
> to
> news, videos, case studies, tutorials and more. Take a look and join the
> conversation now. http://goparallel.sourceforge.net/
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>

------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the 
conversation now. http://goparallel.sourceforge.net/

_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Proposal for GSoC: Dimensionality reduction and features selection

Reply via email to