It appears that you aren't using the same vocabulary in the training and testing.
Reading the docs, it seems you can trying using the "vocabulary" field in CountVectorizer, and having one vocabulary for all models. On Thu, Oct 4, 2012 at 1:39 AM, David Montgomery <[email protected]> wrote: > Hi, > > I have this issue. I am using CountVectorizer to create my X for svm > training. For scoring I pickel the vectorizer and run > transform_utterance_to_array to score using svm. > > > This all works. Problem is that I have 10 models and I have 10 pickle > files that are rather large. > > Since there is overlap with tokens across all models, I tried to > create a single vector for all tokens across all models. > > Then I generate the X for each svm model. > > > When I score using a global model I get this result. > > Traceback (most recent call last): > .... > File "/usr/local/lib/python2.6/dist-packages/sklearn/svm/base.py", > line 479, in predict_proba > X = self._validate_for_predict(X) > File "/usr/local/lib/python2.6/dist-packages/sklearn/svm/base.py", > line 412, in _validate_for_predict > (n_features, self.shape_fit_[1])) > ValueError: X.shape[1] = 304415 should be equal to 74556, the number > of features at training time > > > It seems that from the error that there is a one for one mapping of a > vector and a classifier and there is no short cut for dealing with > multiple vectors. > > In essence, when a X is generated from this code: > self.vectorizer = > CountVectorizer(tokenizer=self.custom_tokenizer,lowercase=self.lowercase,binary=True) > self.X = self.vectorizer.fit_transform(self.corpus) > > Features in the vector must match the dimensions of X. > > I hope this makes sense. > > ------------------------------------------------------------------------------ > Don't let slow site performance ruin your business. Deploy New Relic APM > Deploy New Relic app performance management and know exactly > what is happening inside your Ruby, Python, PHP, Java, and .NET app > Try New Relic at no cost today and get our sweet Data Nerd shirt too! > http://p.sf.net/sfu/newrelic-dev2dev > _______________________________________________ > Scikit-learn-general mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Joseph Turian, Ph.D. | President, MetaOptimize "Optimize Profits. Optimize Engagement." http://metaoptimize.com 855-ALL-DATA The web's most active forum for data scientists: http://metaoptimize.com/qa/ ------------------------------------------------------------------------------ Don't let slow site performance ruin your business. Deploy New Relic APM Deploy New Relic app performance management and know exactly what is happening inside your Ruby, Python, PHP, Java, and .NET app Try New Relic at no cost today and get our sweet Data Nerd shirt too! http://p.sf.net/sfu/newrelic-dev2dev _______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
