Re: [Scikit-learn-general] Not consistent return types of fit_transform methods in Vectorizer classes

Lars Buitinck Fri, 04 Jan 2013 06:14:54 -0800

2013/1/4 Olivier Grisel <[email protected]>:
> I don't think it is an oversight. In one case it was easier to
> generate a CSC layouted datastructure and a COO in the other.


I think you mean CSR here?

> One does not want to trigger a memory copy by calling `.tocsr` in
> advance if the next estimator in the pipeline needs a CSC layout.
>
> CSC representation is more efficient for coordinate descent based
> algorithms (right now we just have linear regression models) or
> (ensembles of) decision trees (currently the sparse input is not
> implemented but it might in the future and at that point CSC will be
> the most adapted memory layout).

But COO->CSC makes a copy as well, right? So we could just as well
build a CSR matrix directly to avoid a copy in the extremely common
CountVectorizer->TfidfTransformer and
CountVectorizer->atleast2d_or_csr pipelines. CSR->CSC shouldn't be
more expensive than COO->CSC. We build CSR matrices in other places:
SVMlight loader, hashing trick.

-- 
Lars Buitinck
Scientific programmer, ILPS
University of Amsterdam

------------------------------------------------------------------------------
Master HTML5, CSS3, ASP.NET, MVC, AJAX, Knockout.js, Web API and
much more. Get web development skills now with LearnDevNow -
350+ hours of step-by-step video tutorials by Microsoft MVPs and experts.
SALE $99.99 this month only -- learn more at:
http://p.sf.net/sfu/learnmore_122812
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Not consistent return types of fit_transform methods in Vectorizer classes

Reply via email to