2011/12/4 David Warde-Farley <[email protected]>: > On Sun, Dec 04, 2011 at 09:16:56PM +0800, Denis Kochedykov wrote: >> >> Hi David, >> >> Thanks, very good points. That is >> >> 1. C++ rather than Python (in fact this, looks like a plus for me - >> performance, universality, etc) > > I agree from the perspective of universality, but beware of the trap of > making speed generalizations about languages. A lot of the speed-critical > parts of sklearn are quite heavily optimized in Cython. I recall that their > coordinate descent (for generalized linear models) implementation compares > quite favourably against a widely used and cleverly written Fortran > implementation.
It depends on the data. The version in sklearn does not have a number of important optimizations found in glmnet (R frontend with a Fortran backend) that can be critical for some n_informative / n_features and n_features / n_samples ratios (I don't remember exactly how. Also correlations between informative features might have an impact on the convergence speed too). > Sounds like Brian has found the decision tree implementation > to be quite speedy as well. Same remark applies here: the regression random forest is still significantly slower in sklearn than in R's GBM. See ongoing work here: https://github.com/scikit-learn/scikit-learn/pull/448 > Suffice it to say, it's possible to write quite fast Python code (and in my > experience, almost always possible to achieve C-like speeds with a dash of > Cython), and it's also possible to really drop the ball and write very slow > C/C++ code. Indeed speed cannot be inferred from the implementation language: the algorithm, default parameters and implementation are much more important. All three varies from one module to another in sklearn and other lib. If you want hard numbers on a specific task I would suggest you to play with http://scikit-learn.github.com/ml-benchmarks/ and add your own dataset and library to it if not represented by the existing. -- Olivier http://twitter.com/ogrisel - http://github.com/ogrisel ------------------------------------------------------------------------------ All the data continuously generated in your IT infrastructure contains a definitive record of customers, application performance, security threats, fraudulent activity, and more. Splunk takes this data and makes sense of it. IT sense. And common sense. http://p.sf.net/sfu/splunk-novd2d _______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
