Hi
Scikit-learn CountVectorizer for bag-of-words approach currently gives two
sub-options: (a) use a custom vocabulary (b) if custom vocabulary is
unavailable, then it makes a vocabulary based on all the words present in the
corpus.
My question: Can we specify a custom vocabulary to begin with, but ensure that
it gets updated when new words are seen while processing the corpus. I am
assuming this is doable since the matrix is stored via a sparse representation.
Usefulness: It will help in cases when one has to add additional documents to
the training data, and one should not have to start from the beginning. Esp
when the number of documents is large.
Sri
------------------------------------------------------------------------------
DreamFactory - Open Source REST & JSON Services for HTML5 & Native Apps
OAuth, Users, Roles, SQL, NoSQL, BLOB Storage and External API Access
Free app hosting. Or install the open source package on any LAMP server.
Sign up and see examples for AngularJS, jQuery, Sencha Touch and Native!
http://pubads.g.doubleclick.net/gampad/clk?id=63469471&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general