Hi Nigel,
What is the proportion of English versus non-English tweets in your data?
It may be the case that your dataset is unbalanced.
Gilles
On 18 October 2013 09:32, Nigel Legg <nigel.l...@gmail.com> wrote:
> I have a set of tweets, and I am trying to use an SVM classifier to class
> them as being English or another language. I have a training set which has
> been classified by hand.
>
> My code:
> train_set = open(train_set)
> corpus = []
> i = 0
> for line in train_set:
> x = line.find(',')
> text = line[:x]
> corpus.append(text)
> i += 1
>
> X = vectorizer.fit_transform(corpus)
> y = np.zeros(i)
> train_set.seek(0)
> i = 0
> for line in train_set:
> x = line.find(',')
> cat = line[x+1:]
> cat = cat[1:-1]
> if cat == 'yes': y[i]=0
> if cat == 'no': y[i]=1
> i+=1
> print(str(y))
>
> from sklearn import svm
>
> clf = svm.SVC()
> clf.fit(X,y)
> xtest = input('Test File::.... ')
> test_set = xpath+xtest
> out = open(xpath+'class.csv', 'w')
> i = 0
> for line in open(test_set):
> x = vectorizer.transform(line).toarray()
> r = clf.predict(x)
> if r.all() == 0: x_cat = 'yes'
> if r.any() == 1: x_cat = 'no'
> print str(i), line, x_cat, str(r)
> out.write(line[:-1]+', '+x_cat+'\n')
> i +=1
>
> The problem is that r is always all zeros, so all the tweet get classified
> as yes, or English.
> What am I doing wrong here?
>
> Cheers, Nigel
> 07914 740972
>
>
>
> ------------------------------------------------------------------------------
> October Webinars: Code for Performance
> Free Intel webinars can help you accelerate application performance.
> Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most
> from
> the latest Intel processors and coprocessors. See abstracts and register >
> http://pubads.g.doubleclick.net/gampad/clk?id=60135031&iu=/4140/ostg.clktrk
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
------------------------------------------------------------------------------
October Webinars: Code for Performance
Free Intel webinars can help you accelerate application performance.
Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from
the latest Intel processors and coprocessors. See abstracts and register >
http://pubads.g.doubleclick.net/gampad/clk?id=60135031&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general