Hello scikit-learn team,
I currently work as a developer for the ilastik project (http://ilastik.org/),
and I will be starting a PhD in bioinformatics at UCSD this fall. I would
like to participate in the Google Summer of Code this year.
I have hacked on scikit-learn for my own work in the past. Here are a
couple branches I've worked on, mainly to solve a specific problem or test
an experimental feature for random forests:
-
empirical-forest<https://github.com/kemaleren/scikit-learn/tree/empirical-forest>:
multiple response regression with random forests. I planned to clean this
up and submit a pull request, but someone else beat me to it.
- collapsed_rf<https://github.com/kemaleren/scikit-learn/tree/collapsed_rf>:
postprocess multiple regression random forests to sum their responses. This
was for a project where we needed to train on vector responses but only
predict their sum.
- fast_rfr <https://github.com/kemaleren/scikit-learn/tree/fast_rfr>: a
random forest optimized for a regression problem in which many leaves
returned zero arrays.
- sse <https://github.com/kemaleren/scikit-learn/tree/sse>:
experimenting with a different split criterion for regression trees.
- ultrarf <https://github.com/kemaleren/scikit-learn/tree/ultrarf>: an
attempt to speed up training by using ultrametric distance, which can be
precomputed in linear time and queried in constant time.
Since I will be free this summer, I would like to finally contribute back
to scikit-learn. I am open to project suggestions. For instance, since I
have worked with random forests before, it might make sense to work on
supporting sparse numpy arrays.
However, I have another project that I actually already started last
year: implementing stacked generalization. It would be great to be funded
to finish this project this summer.
It's still pretty rough, but you can see what I did so far in this branch:
stacking <https://github.com/kemaleren/scikit-learn/tree/stacking>. As you
can see, there is still lots to be done, including adding other stacking
methods such as Feature-Weighted Linear Stacking, supporting various voting
schemes, etc.
This could be a very useful addition to the scikit-learn toolbox. Is there
anyone interested in mentoring this project?
Best regards,
Kemal Eren
------------------------------------------------------------------------------
Precog is a next-generation analytics platform capable of advanced
analytics on semi-structured data. The platform includes APIs for building
apps and a phenomenal toolset for data science. Developers can use
our toolset for easy data analysis & visualization. Get a free account!
http://www2.precog.com/precogplatform/slashdotnewsletter
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general