Re: [Scikit-learn-general] motivation for the lib, why re-implement existing stuff

Olivier Grisel Sun, 04 Dec 2011 08:29:21 -0800

2011/12/4 David Warde-Farley <[email protected]>:
> On Sun, Dec 04, 2011 at 09:16:56PM +0800, Denis Kochedykov wrote:
>>
>> Hi David,
>>
>> Thanks, very good points. That is
>>
>> 1. C++ rather than Python (in fact this, looks like a plus for me -
>> performance, universality, etc)
>
> I agree from the perspective of universality, but beware of the trap of
> making speed generalizations about languages. A lot of the speed-critical
> parts of sklearn are quite heavily optimized in Cython. I recall that their
> coordinate descent (for generalized linear models) implementation compares
> quite favourably against a widely used and cleverly written Fortran
> implementation.


It depends on the data. The version in sklearn does not have a number
of important optimizations found in glmnet (R frontend with a Fortran
backend) that can be critical for some n_informative / n_features and
n_features / n_samples ratios (I don't remember exactly how. Also
correlations between informative features might have an impact on the
convergence speed too).

> Sounds like Brian has found the decision tree implementation
> to be quite speedy as well.

Same remark applies here: the regression random forest is still
significantly slower in sklearn than in R's GBM. See ongoing work
here:

  https://github.com/scikit-learn/scikit-learn/pull/448

> Suffice it to say, it's possible to write quite fast Python code (and in my
> experience, almost always possible to achieve C-like speeds with a dash of
> Cython), and it's also possible to really drop the ball and write very slow
> C/C++ code.

Indeed speed cannot be inferred from the implementation language: the
algorithm, default parameters and implementation are much more
important. All three varies from one module to another in sklearn and
other lib.

If you want hard numbers on a specific task I would suggest you to
play with http://scikit-learn.github.com/ml-benchmarks/ and add your
own dataset and library to it if not represented by the existing.

-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

------------------------------------------------------------------------------
All the data continuously generated in your IT infrastructure 
contains a definitive record of customers, application performance, 
security threats, fraudulent activity, and more. Splunk takes this 
data and makes sense of it. IT sense. And common sense.
http://p.sf.net/sfu/splunk-novd2d
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] motivation for the lib, why re-implement existing stuff

Reply via email to