Re: [Scikit-learn-general] Progress and mid-term evaluation: Speedup of coordinate descent for linear models

Olivier Grisel Tue, 10 Jul 2012 02:28:21 -0700

2012/7/10 iBayer <[email protected]>:
> Dear all,
>
> since the start of the project I've been in continuous exchange with my
> mentor (Alexandre Gramfort)
>
> via several pull-request comments. There, I've been reporting my status and
> asked for feedback, when needed. The promptly feedback of Alexandre kept me
> going and assured me being on the right track, despite several unforeseen
> obstacles, I had to master :).
>
>
> In detail, I've been working on the following tasks (listed in order of
> processing)
>
> Merging of dense and sparse classes of ElasticNet and Lasso:
>
> Done PR: #891 (https://github.com/scikit-learn/scikit-learn/pull/891)
>
> Setting up benchmark code and datasets:
>
> Hacked together some scripts to call glmnet via RPy2.
> https://gist.github.com/3078871
>
> Setup Vlad's vbench fork.
>
> Tried to find a way to make the benchmark datasets  easily accessible by
> uploading them to mldata.org.
> This was mainly motivated by the existing code in scikit-learn to download
> data from mldata.org. https://gist.github.com/3078964
>
> Postponed work on this part, after loosing to much time due to mldata.org's
> outdated hdf5 and general very limited documentation.
>
> Developing replacement for the enet coordinate descent algorithm by
> incorporating tricks form glmnet:
>
> Progress documentation PR #911
> (https://github.com/scikit-learn/scikit-learn/pull/911)
>
> Prototype in Python
>
> Covariance updates
>
> Active set of features
>
> Integration of Python prototype in cd_fast.pyx.
>
> Speedups in Cython:
>
> Type def
>
> Using cblas functions
>
> Currently profiling. I suspect the active set implementation to be the
> bottleneck. The ongoing investigation is documented in  PR #911, please feel
> free to comment.
>
>
> Caching in covariance updates turned out to be far more complex as I
> expected, when the memory consumption is kept in reasonable bounds.
>
> Tests for the new regression implementation are in place. Checking the new
> code with simple examples and against the current implementation. I'm
> therefore positive that my implementation is working correctly.
>
> I haven't started with the implementation of the logistic regression models,
> since the regression case is expected to give useful experience for the
> implementation or the logistic regression models. Unfortunately, the
> implementation of the regression case proved to be more complex than
> expected and is not done yet.
>
> The glmnet implementation is not yet competitive with the current
> implementation. I plan to reach that step till the midterm evaluation.


What do you mean exactly by glmnet? To me glmnet is the name of the R
project implementing various l1 or l1+l2 penalized linear models using
coordinate descent + various implementation / algorithmic tricks such
as maintaining an active set of features with non zero weights.

The current content of cd_fast.pyx is a partial implementation of
those (only squared error regression loss without active set).

-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Progress and mid-term evaluation: Speedup of coordinate descent for linear models

Reply via email to