I’m having some conceptual trouble with this supervised machine learning 
project (regression) that hopefully someone can help me with.

I am trying to do sentiment analysis on texts (scoring them from -10 to 
+10) based on a human-scored training set.

Training set:
Cases = 35
Score Mean = 0.77
Score STD =8.07

Testing set:
Cases = 12
Score Mean = -2.08
Score STD = 7.43

Features:
Number: 8
They are:
Two scores based on word frequency.
These correlate highly with the real scores.
The rest are features of the text such as ‘punctuation density’

Evaluation:
I calculate the prediction accuracy by finding the mean error
between the prediction (machine score) and target (real
human score).

Methods, In order of success:
Linear Regression (OLS):
Code: linear_model.LinearRegression()
Result:
Training Set Mean Error: 7.51
Training Set STDV of Error: 6.58
Testing Set Mean Error: 90.29
Testing Set STDV of Error: 11.26
Support Vector Regression (SVR), Linear:
Code: SVR(kernel="linear")
Result:
Training Set Mean Error: 8.17
Training Set STDV of Error: 8.55
Testing Set Mean Error: 89.93
Testing Set STDV of Error: 11.12
Ridge Regression:
Code: linear_model.Ridge()
Result:
Training Set Mean Error: 8.39
Training Set STDV of Error: 7.46
Testing Set Mean Error: 90.65
Testing Set STDV of Error: 11.13
Support Vector Regression (SVR), 2nd degree polynomial:
Code: SVR(kernel="poly", degree=2)
Result:
Training Set Mean Error: 9.16
Training Set STDV of Error: 7.35
Testing Set Mean Error: 107.31
Testing Set STDV of Error: 35.19

But as you can see, the predictions are absolutely terrible, no matter 
what I do.
The training set predictions are quite accurate though. From my reading, 
this could be due to over fitting. However, I don’t see how simple 
linear model (OLS) could over fit anything… On top of that, the features 
I’m working with lends itself to prediction (the features based on word 
frequencies in particular correlate highly with the real human scores – 
I’ve even tried Neural Networks in SPSS using the default settings and 
the training set prediction works well. But I can’t get anything to work 
well here in SciKit-Learn.

So what’s the problem?

Thanks so much,

Zach

------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to