[Scikit-learn-general] Progress and mid-term evaluation: Speedup of coordinate descent for linear models

iBayer Tue, 10 Jul 2012 03:06:39 -0700

Dear all,

since the start of the project I've been in continuous exchange with my
mentor (Alexandre Gramfort)

via several pull-request comments. There, I've been reporting my status and
asked for feedback, when needed. The promptly feedback of Alexandre kept me
going and assured me being on the right track, despite several unforeseen
obstacles, I had to master :).

In detail, I've been working on the following tasks (listed in order of
processing)

-
-

Merging of dense and sparse classes of ElasticNet and Lasso:
-

Done PR: #891 (
https://github.com/scikit-learn/scikit-learn/pull/891)
-

Setting up benchmark code and datasets:
-

Hacked together some scripts to call glmnet via RPy2.
https://gist.github.com/3078871
-

Setup Vlad's vbench fork.
-

Tried to find a way to make the benchmark
datasets<https://github.com/scikit-learn/scikit-learn/wiki/Setting-up-tests-to-benchmark-current-and-future-code>
easily accessible by uploading them to mldata.org.
This was mainly motivated by the existing code in scikit-learn to
download data from mldata.org. https://gist.github.com/3078964
-

Postponed work on this part, after loosing to much time due to
mldata.org's outdated hdf5 and general very limited documentation.
-

Developing replacement for the enet coordinate descent algorithm by
incorporating tricks form glmnet:
-

Progress documentation PR #911 (
https://github.com/scikit-learn/scikit-learn/pull/911)
-

Prototype in Python
-

Covariance updates
-

Active set of features
-

Integration of Python prototype in cd_fast.pyx.
-

Speedups in Cython:
-

Type def
-

Using cblas functions
-

Currently profiling. I suspect the active set implementation to
be the bottleneck. The ongoing investigation is documented
in PR #911,
please feel free to comment.

Caching in covariance updates turned out to be far more complex as I
expected, when the memory consumption is kept in reasonable bounds.
-

Tests for the new regression implementation are in place. Checking the
new code with simple examples and against the current implementation. I'm
therefore positive that my implementation is working correctly.

I haven't started with the implementation of the logistic regression
models, since the regression case is expected to give useful experience for
the implementation or the logistic regression models. Unfortunately, the
implementation of the regression case proved to be more complex than
expected and is not done yet.
-

The glmnet implementation is not yet competitive with the current
implementation. I plan to reach that step till the midterm evaluation.

Thanks to all involved for constantly giving constructive feedback.
Especially, Alexandre for his amazing real-time communication and Olivier
Grisel for his valuable advice concerning Cython. I have learned a lot and
had fun doing so, even though some tasks turned out to be more demanding
then expected previously :)*.*

Best,

Immanuel

------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/

_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

[Scikit-learn-general] Progress and mid-term evaluation: Speedup of coordinate descent for linear models

Reply via email to