Hi Nigel,

What is the proportion of English versus non-English tweets in your data?
It may be the case that your dataset is unbalanced.

Gilles


On 18 October 2013 09:32, Nigel Legg <nigel.l...@gmail.com> wrote:

> I have a set of tweets, and I am trying to use an SVM classifier to class
> them as being English or another language.  I have a training set which has
> been classified by hand.
>
> My code:
> train_set = open(train_set)
> corpus = []
> i = 0
> for line in train_set:
>     x = line.find(',')
>     text = line[:x]
>     corpus.append(text)
>     i += 1
>
> X = vectorizer.fit_transform(corpus)
> y = np.zeros(i)
> train_set.seek(0)
> i = 0
> for line in train_set:
>     x = line.find(',')
>     cat = line[x+1:]
>     cat = cat[1:-1]
>     if cat == 'yes': y[i]=0
>     if cat == 'no': y[i]=1
>     i+=1
> print(str(y))
>
> from sklearn import svm
>
> clf = svm.SVC()
> clf.fit(X,y)
> xtest = input('Test File::....  ')
> test_set = xpath+xtest
> out = open(xpath+'class.csv', 'w')
> i = 0
> for line in open(test_set):
>     x = vectorizer.transform(line).toarray()
>     r = clf.predict(x)
>     if r.all() == 0: x_cat = 'yes'
>     if r.any() == 1: x_cat = 'no'
>     print str(i), line, x_cat, str(r)
>     out.write(line[:-1]+', '+x_cat+'\n')
>     i +=1
>
> The problem is that r is always all zeros, so all the tweet get classified
> as yes, or English.
> What am I doing wrong here?
>
> Cheers, Nigel
> 07914 740972
>
>
>
> ------------------------------------------------------------------------------
> October Webinars: Code for Performance
> Free Intel webinars can help you accelerate application performance.
> Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most
> from
> the latest Intel processors and coprocessors. See abstracts and register >
> http://pubads.g.doubleclick.net/gampad/clk?id=60135031&iu=/4140/ostg.clktrk
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
------------------------------------------------------------------------------
October Webinars: Code for Performance
Free Intel webinars can help you accelerate application performance.
Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from 
the latest Intel processors and coprocessors. See abstracts and register >
http://pubads.g.doubleclick.net/gampad/clk?id=60135031&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to