On Fri, Jan 4, 2013 at 11:32 PM, Olivier Grisel <[email protected]>wrote:

>
> +1 for materializing the initial sparse datastructure directly as a
> CSR instead of a COO. But that is less trivial than just calling
> ".tocsr()".
>
> If someone want to volunteer, he/she should have a look at the
> implementation of the FeatureHasher class as an example of a CSR
> datastructure generation.
>

It would be nice to build the matrix directly in CSR but not essential:
CountVectorizer can only deal with rather medium-scale datasets anyway so a
conversion with tocsr() is reasonable (although it does assume that the
dataset can fit twice in memory). FeatureHasher, on the other hand, targets
large-scale datasets so dealing with CSR directly (the most common format)
was a good choice.

As a big fan of coordinate descent, I would really like if the scikit could
be more CSC friendly (svmlight loader, preprocessing tools, feature
hasher). The svmlight format is CSR oriented but using a two-pass algorithm
or some temporary intermediary data structures, I think it should be
possible to construct a CSC matrix directly.

Mathieu
------------------------------------------------------------------------------
Master HTML5, CSS3, ASP.NET, MVC, AJAX, Knockout.js, Web API and
much more. Get web development skills now with LearnDevNow -
350+ hours of step-by-step video tutorials by Microsoft MVPs and experts.
SALE $99.99 this month only -- learn more at:
http://p.sf.net/sfu/learnmore_122812
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to