[Scikit-learn-general] CountVectorizer vocabulary

Srevastan Muralidharan Thu, 14 Nov 2013 08:36:27 -0800

Hi

 Scikit-learn CountVectorizer for bag-of-words approach currently gives two 
sub-options: (a) use a custom vocabulary (b) if custom vocabulary is 
unavailable, then it makes a vocabulary based on all the words present in the 
corpus.


My question: Can we specify a custom vocabulary to begin with, but ensure that 
it gets updated when new words are seen while processing the corpus. I am 
assuming this is doable since the matrix is stored via a sparse representation.

Usefulness: It will help in cases when one has to add additional documents to 
the training data, and one should not have to start from the beginning. Esp 
when the number of documents is large. 


Sri

------------------------------------------------------------------------------
DreamFactory - Open Source REST & JSON Services for HTML5 & Native Apps
OAuth, Users, Roles, SQL, NoSQL, BLOB Storage and External API Access
Free app hosting. Or install the open source package on any LAMP server.
Sign up and see examples for AngularJS, jQuery, Sencha Touch and Native!
http://pubads.g.doubleclick.net/gampad/clk?id=63469471&iu=/4140/ostg.clktrk

_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

[Scikit-learn-general] CountVectorizer vocabulary

Reply via email to