I have a set of tweets, and I am trying to use an SVM classifier to class
them as being English or another language.  I have a training set which has
been classified by hand.

My code:
train_set = open(train_set)
corpus = []
i = 0
for line in train_set:
    x = line.find(',')
    text = line[:x]
    corpus.append(text)
    i += 1

X = vectorizer.fit_transform(corpus)
y = np.zeros(i)
train_set.seek(0)
i = 0
for line in train_set:
    x = line.find(',')
    cat = line[x+1:]
    cat = cat[1:-1]
    if cat == 'yes': y[i]=0
    if cat == 'no': y[i]=1
    i+=1
print(str(y))

from sklearn import svm

clf = svm.SVC()
clf.fit(X,y)
xtest = input('Test File::....  ')
test_set = xpath+xtest
out = open(xpath+'class.csv', 'w')
i = 0
for line in open(test_set):
    x = vectorizer.transform(line).toarray()
    r = clf.predict(x)
    if r.all() == 0: x_cat = 'yes'
    if r.any() == 1: x_cat = 'no'
    print str(i), line, x_cat, str(r)
    out.write(line[:-1]+', '+x_cat+'\n')
    i +=1

The problem is that r is always all zeros, so all the tweet get classified
as yes, or English.
What am I doing wrong here?

Cheers, Nigel
07914 740972
------------------------------------------------------------------------------
October Webinars: Code for Performance
Free Intel webinars can help you accelerate application performance.
Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from 
the latest Intel processors and coprocessors. See abstracts and register >
http://pubads.g.doubleclick.net/gampad/clk?id=60135031&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to