I’m having some conceptual trouble with this supervised machine learning project (regression) that hopefully someone can help me with.
I am trying to do sentiment analysis on texts (scoring them from -10 to +10) based on a human-scored training set. Training set: Cases = 35 Score Mean = 0.77 Score STD =8.07 Testing set: Cases = 12 Score Mean = -2.08 Score STD = 7.43 Features: Number: 8 They are: Two scores based on word frequency. These correlate highly with the real scores. The rest are features of the text such as ‘punctuation density’ Evaluation: I calculate the prediction accuracy by finding the mean error between the prediction (machine score) and target (real human score). Methods, In order of success: Linear Regression (OLS): Code: linear_model.LinearRegression() Result: Training Set Mean Error: 7.51 Training Set STDV of Error: 6.58 Testing Set Mean Error: 90.29 Testing Set STDV of Error: 11.26 Support Vector Regression (SVR), Linear: Code: SVR(kernel="linear") Result: Training Set Mean Error: 8.17 Training Set STDV of Error: 8.55 Testing Set Mean Error: 89.93 Testing Set STDV of Error: 11.12 Ridge Regression: Code: linear_model.Ridge() Result: Training Set Mean Error: 8.39 Training Set STDV of Error: 7.46 Testing Set Mean Error: 90.65 Testing Set STDV of Error: 11.13 Support Vector Regression (SVR), 2nd degree polynomial: Code: SVR(kernel="poly", degree=2) Result: Training Set Mean Error: 9.16 Training Set STDV of Error: 7.35 Testing Set Mean Error: 107.31 Testing Set STDV of Error: 35.19 But as you can see, the predictions are absolutely terrible, no matter what I do. The training set predictions are quite accurate though. From my reading, this could be due to over fitting. However, I don’t see how simple linear model (OLS) could over fit anything… On top of that, the features I’m working with lends itself to prediction (the features based on word frequencies in particular correlate highly with the real human scores – I’ve even tried Neural Networks in SPSS using the default settings and the training set prediction works well. But I can’t get anything to work well here in SciKit-Learn. So what’s the problem? Thanks so much, Zach ------------------------------------------------------------------------------ Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ _______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
