2013/1/4 Mathieu Blondel <[email protected]>:
> It would be nice to build the matrix directly in CSR but not essential:
> CountVectorizer can only deal with rather medium-scale datasets anyway so a
> conversion with tocsr() is reasonable (although it does assume that the
> dataset can fit twice in memory). FeatureHasher, on the other hand, targets
> large-scale datasets so dealing with CSR directly (the most common format)
> was a good choice.

The proposed improvements by @ephes (still somewhere on my TODO list)
would improve the scalability of CountVectorizer greatly, though...

> As a big fan of coordinate descent, I would really like if the scikit could
> be more CSC friendly (svmlight loader, preprocessing tools, feature hasher).
> The svmlight format is CSR oriented but using a two-pass algorithm or some
> temporary intermediary data structures, I think it should be possible to
> construct a CSC matrix directly.

Maybe there's a way to convert CSR to CSC without copying the values?
With our favorite dtype=np.float64, these make up the bulk of a
typical CS[CR] matrix in terms of memory use.

-- 
Lars Buitinck
Scientific programmer, ILPS
University of Amsterdam

------------------------------------------------------------------------------
Master HTML5, CSS3, ASP.NET, MVC, AJAX, Knockout.js, Web API and
much more. Get web development skills now with LearnDevNow -
350+ hours of step-by-step video tutorials by Microsoft MVPs and experts.
SALE $99.99 this month only -- learn more at:
http://p.sf.net/sfu/learnmore_122812
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to