2013/1/4 Lars Buitinck <[email protected]>:
> 2013/1/4 Olivier Grisel <[email protected]>:
>> I don't think it is an oversight. In one case it was easier to
>> generate a CSC layouted datastructure and a COO in the other.
>
> I think you mean CSR here?
>
>> One does not want to trigger a memory copy by calling `.tocsr` in
>> advance if the next estimator in the pipeline needs a CSC layout.
>>
>> CSC representation is more efficient for coordinate descent based
>> algorithms (right now we just have linear regression models) or
>> (ensembles of) decision trees (currently the sparse input is not
>> implemented but it might in the future and at that point CSC will be
>> the most adapted memory layout).
>
> But COO->CSC makes a copy as well, right? So we could just as well
> build a CSR matrix directly to avoid a copy in the extremely common
> CountVectorizer->TfidfTransformer and
> CountVectorizer->atleast2d_or_csr pipelines. CSR->CSC shouldn't be
> more expensive than COO->CSC. We build CSR matrices in other places:
> SVMlight loader, hashing trick.

+1 for materializing the initial sparse datastructure directly as a
CSR instead of a COO. But that is less trivial than just calling
".tocsr()".

If someone want to volunteer, he/she should have a look at the
implementation of the FeatureHasher class as an example of a CSR
datastructure generation.

--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

------------------------------------------------------------------------------
Master HTML5, CSS3, ASP.NET, MVC, AJAX, Knockout.js, Web API and
much more. Get web development skills now with LearnDevNow -
350+ hours of step-by-step video tutorials by Microsoft MVPs and experts.
SALE $99.99 this month only -- learn more at:
http://p.sf.net/sfu/learnmore_122812
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to