Hello. I'm looking at feature_extraction.dict_vectorizer and I'm
wondering why fit() and restrict() use a sorted list of feature names
rather than their naturally-encountered order?

Is there an algorithmic requirement somewhere for sorted feature names?

Context - I'm working on a similarity-measurement system (45k cols *
1mil rows, csr sparse matrix, 600MB), one requirement will be to
occasionally add a column or row. Avoiding a full rebuild of the
vectorizer and dynamically updating the mapping seems like a sensible
idea, but I'm not understanding why the feature name list is sorted().
I'm slowly working through the client's requirements to see if
avoiding a full rebuild is feasible. This is for an online production
system.

Reusing (and probably inheriting) the sklearn vectorizer would be
nice, rather than rolling a custom solution in numpy. If anyone's
curious, my best approach to resizing the csr array is via
http://stackoverflow.com/questions/6844998/is-there-an-efficient-way-of-concatenating-scipy-sparse-matrices/6853880#6853880
which costs 10 seconds and a temporary +2GB overall.
(and if you have a better suggestion for growing a csr matrix, I'd
love to hear it)

Now if anyone's done this sort of thing before and wants to chat about
it, I'd love to say Hi.

Ian.

-- 
Ian Ozsvald (A.I. researcher)
i...@ianozsvald.com

http://IanOzsvald.com
http://ModelInsight.io
http://MorConsulting.com
http://Annotate.IO
http://SocialTiesApp.com
http://TheScreencastingHandbook.com
http://FivePoundApp.com
http://twitter.com/IanOzsvald
http://ShowMeDo.com

------------------------------------------------------------------------------
"Accelerate Dev Cycles with Automated Cross-Browser Testing - For FREE
Instantly run your Selenium tests across 300+ browser/OS combos.  Get 
unparalleled scalability from the best Selenium testing platform available.
Simple to use. Nothing to install. Get started now for free."
http://p.sf.net/sfu/SauceLabs
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to