Hello. I'm looking at feature_extraction.dict_vectorizer and I'm wondering why fit() and restrict() use a sorted list of feature names rather than their naturally-encountered order?
Is there an algorithmic requirement somewhere for sorted feature names? Context - I'm working on a similarity-measurement system (45k cols * 1mil rows, csr sparse matrix, 600MB), one requirement will be to occasionally add a column or row. Avoiding a full rebuild of the vectorizer and dynamically updating the mapping seems like a sensible idea, but I'm not understanding why the feature name list is sorted(). I'm slowly working through the client's requirements to see if avoiding a full rebuild is feasible. This is for an online production system. Reusing (and probably inheriting) the sklearn vectorizer would be nice, rather than rolling a custom solution in numpy. If anyone's curious, my best approach to resizing the csr array is via http://stackoverflow.com/questions/6844998/is-there-an-efficient-way-of-concatenating-scipy-sparse-matrices/6853880#6853880 which costs 10 seconds and a temporary +2GB overall. (and if you have a better suggestion for growing a csr matrix, I'd love to hear it) Now if anyone's done this sort of thing before and wants to chat about it, I'd love to say Hi. Ian. -- Ian Ozsvald (A.I. researcher) i...@ianozsvald.com http://IanOzsvald.com http://ModelInsight.io http://MorConsulting.com http://Annotate.IO http://SocialTiesApp.com http://TheScreencastingHandbook.com http://FivePoundApp.com http://twitter.com/IanOzsvald http://ShowMeDo.com ------------------------------------------------------------------------------ "Accelerate Dev Cycles with Automated Cross-Browser Testing - For FREE Instantly run your Selenium tests across 300+ browser/OS combos. Get unparalleled scalability from the best Selenium testing platform available. Simple to use. Nothing to install. Get started now for free." http://p.sf.net/sfu/SauceLabs _______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general