I'm working through the tutorial, and also experimenting kind of on my
own.  I'm on the text analysis example, and am curious about the relative
merits of analyzing by word frequency, relative frequency, and adjusted
relative frequency.  Using the 20 newsgroups data, I've built a set of
pipelines within a cross_validation loop; the important part of the code is
here:

# get the data
nw = dat.datetime.now()
rndstat = nw.hour*3600+nw.minute*60+nw.second
twenty_train = fetch_20newsgroups(subset='train', categories=categories,
random_state = rndstat, shuffle=True, download_if_missing=False)
twenty_test = fetch_20newsgroups(subset='test', categories=categories,
random_state = rndstat, shuffle=True, download_if_missing=False)

# first with raw counts
text_clf = Pipeline([('vect', CountVectorizer()), ('clf', MultinomialNB())])
text_clf.fit(twenty_train.data,twenty_train.target)
pred = text_clf.predict(twenty_test.data)
test_ccrs[mccnt,0] = sum(pred == twenty_test.target)/len(twenty_test.target)

The issue is that everytime I run this, though I've confirmed the data
sampled is different, the value in test_ccrs is *always* the same.  Am I
missing something?

Thanks!
Andrew

<~~~~~~~~~~~~~~~~~~~~~~~~~~~>
J. Andrew Howe, PhD
Editor-in-Chief, European Journal of Mathematical Sciences
Executive Editor, European Journal of Pure and Applied Mathematics
www.andrewhowe.com
http://www.linkedin.com/in/ahowe42
https://www.researchgate.net/profile/John_Howe12/
I live to learn, so I can learn to live. - me
<~~~~~~~~~~~~~~~~~~~~~~~~~~~>
------------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to