I'm working through the tutorial, and also experimenting kind of on my own. I'm on the text analysis example, and am curious about the relative merits of analyzing by word frequency, relative frequency, and adjusted relative frequency. Using the 20 newsgroups data, I've built a set of pipelines within a cross_validation loop; the important part of the code is here:
# get the data nw = dat.datetime.now() rndstat = nw.hour*3600+nw.minute*60+nw.second twenty_train = fetch_20newsgroups(subset='train', categories=categories, random_state = rndstat, shuffle=True, download_if_missing=False) twenty_test = fetch_20newsgroups(subset='test', categories=categories, random_state = rndstat, shuffle=True, download_if_missing=False) # first with raw counts text_clf = Pipeline([('vect', CountVectorizer()), ('clf', MultinomialNB())]) text_clf.fit(twenty_train.data,twenty_train.target) pred = text_clf.predict(twenty_test.data) test_ccrs[mccnt,0] = sum(pred == twenty_test.target)/len(twenty_test.target) The issue is that everytime I run this, though I've confirmed the data sampled is different, the value in test_ccrs is *always* the same. Am I missing something? Thanks! Andrew <~~~~~~~~~~~~~~~~~~~~~~~~~~~> J. Andrew Howe, PhD Editor-in-Chief, European Journal of Mathematical Sciences Executive Editor, European Journal of Pure and Applied Mathematics www.andrewhowe.com http://www.linkedin.com/in/ahowe42 https://www.researchgate.net/profile/John_Howe12/ I live to learn, so I can learn to live. - me <~~~~~~~~~~~~~~~~~~~~~~~~~~~>
------------------------------------------------------------------------------
_______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general