2013/1/4 Mathieu Blondel <[email protected]>: > It would be nice to build the matrix directly in CSR but not essential: > CountVectorizer can only deal with rather medium-scale datasets anyway so a > conversion with tocsr() is reasonable (although it does assume that the > dataset can fit twice in memory). FeatureHasher, on the other hand, targets > large-scale datasets so dealing with CSR directly (the most common format) > was a good choice.
The proposed improvements by @ephes (still somewhere on my TODO list) would improve the scalability of CountVectorizer greatly, though... > As a big fan of coordinate descent, I would really like if the scikit could > be more CSC friendly (svmlight loader, preprocessing tools, feature hasher). > The svmlight format is CSR oriented but using a two-pass algorithm or some > temporary intermediary data structures, I think it should be possible to > construct a CSC matrix directly. Maybe there's a way to convert CSR to CSC without copying the values? With our favorite dtype=np.float64, these make up the bulk of a typical CS[CR] matrix in terms of memory use. -- Lars Buitinck Scientific programmer, ILPS University of Amsterdam ------------------------------------------------------------------------------ Master HTML5, CSS3, ASP.NET, MVC, AJAX, Knockout.js, Web API and much more. Get web development skills now with LearnDevNow - 350+ hours of step-by-step video tutorials by Microsoft MVPs and experts. SALE $99.99 this month only -- learn more at: http://p.sf.net/sfu/learnmore_122812 _______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
