Re: [Scikit-learn-general] GSOC 2013 proposal: biclustering

Kemal Eren Mon, 29 Apr 2013 00:21:33 -0700

Hi all,

Thanks for your comments. I have made the suggested revisions to my
proposal. A few comments and questions:

Since nsNMF is out, there is still some time available. Any other
algorithms that you would be interested in?

The Spectral coclustering algorithm from 2001 with 888 citations is a very
similar model to the Kluger paper from 2003, which applied the same
concepts to microarray data. I originally cited the Kluger paper only
because it is more well known in my field. I have added a link to Dhillon's
paper, too.

Going by pure citations, I'd say the original Cheng and Church algorithm is
the most popular (1417 citations). In my experience, no one uses it
directly anymore, but it is included as a benchmark in almost every paper.

Since missing values imputation is a better fit elsewhere, I have removed
the data preprocessing work from this proposal. Unless I think of some
really useful preprocessing methods, I think the project will benefit from
being more focused.

Best regards,
Kemal

On Mon, Apr 29, 2013 at 6:55 AM, Mathieu Blondel <math...@mblondel.org>wrote:

> Hi Kemal,
>
> Thanks a lot for the modifications. The introduction is now much better
> and the figure is really helpful to visualize what biclustering can do!
>
> Some further comments...
>
> To keep the "proposal timeline" section more concise and focused on your
> schedule during the summer, I would move the descriptions of data
> preprocessing, data generation and evaluation metrics to the previous
> section (you can introduce subsections). While doing that, can you also
> describe in more details what kind of data generation tool you want to add?
>
> Regarding nsNMF, following the previous discussion, I feel that it may not
> be a good fit for this GSOC: you cannot reuse / depend on GPL code and
> implementing an NMF method will be time-consuming.
>
> I'm a bit concerned with adding to scikit-learn a 2012 paper with only 1
> citation. For this reason, I think I would prefer if you implemented the
> BiMax paper, which as 427 citations.
>
> Regarding fit_predict / predict, the output shape is not compatible with
> the rest of scikit-learn. Therefore, I think we should just expect users to
> directly access the fitted attributes. Can you give an actual code snippet
> and use the same notation as in scikit-learn (e.g., n -> n_samples)?
>
> Regarding missing value imputation, I think that it would be a more
> natural fit in the matrix completion project.
>
> Mathieu
>
>
> ------------------------------------------------------------------------------
> Try New Relic Now & We'll Send You this Cool Shirt
> New Relic is the only SaaS-based application performance monitoring service
> that delivers powerful full stack analytics. Optimize and monitor your
> browser, app, & servers with just a few lines of code. Try New Relic
> and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>

------------------------------------------------------------------------------
Try New Relic Now & We'll Send You this Cool Shirt
New Relic is the only SaaS-based application performance monitoring service 
that delivers powerful full stack analytics. Optimize and monitor your
browser, app, & servers with just a few lines of code. Try New Relic
and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr

_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] GSOC 2013 proposal: biclustering

Reply via email to