2013/1/4 Lars Buitinck <[email protected]>: > 2013/1/4 Olivier Grisel <[email protected]>: >> I don't think it is an oversight. In one case it was easier to >> generate a CSC layouted datastructure and a COO in the other. > > I think you mean CSR here? > >> One does not want to trigger a memory copy by calling `.tocsr` in >> advance if the next estimator in the pipeline needs a CSC layout. >> >> CSC representation is more efficient for coordinate descent based >> algorithms (right now we just have linear regression models) or >> (ensembles of) decision trees (currently the sparse input is not >> implemented but it might in the future and at that point CSC will be >> the most adapted memory layout). > > But COO->CSC makes a copy as well, right? So we could just as well > build a CSR matrix directly to avoid a copy in the extremely common > CountVectorizer->TfidfTransformer and > CountVectorizer->atleast2d_or_csr pipelines. CSR->CSC shouldn't be > more expensive than COO->CSC. We build CSR matrices in other places: > SVMlight loader, hashing trick.
+1 for materializing the initial sparse datastructure directly as a CSR instead of a COO. But that is less trivial than just calling ".tocsr()". If someone want to volunteer, he/she should have a look at the implementation of the FeatureHasher class as an example of a CSR datastructure generation. -- Olivier http://twitter.com/ogrisel - http://github.com/ogrisel ------------------------------------------------------------------------------ Master HTML5, CSS3, ASP.NET, MVC, AJAX, Knockout.js, Web API and much more. Get web development skills now with LearnDevNow - 350+ hours of step-by-step video tutorials by Microsoft MVPs and experts. SALE $99.99 this month only -- learn more at: http://p.sf.net/sfu/learnmore_122812 _______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
