Hi Zach,

if you provide a gist with your evaluation setup (similar to this one
[1]) I can look into it.

best,
 Peter

[1] https://gist.github.com/3266657

2012/8/9 Zach Bastick <[email protected]>:
> I’m having some conceptual trouble with this supervised machine learning
> project (regression) that hopefully someone can help me with.
>
> I am trying to do sentiment analysis on texts (scoring them from -10 to
> +10) based on a human-scored training set.
>
> Training set:
> Cases = 35
> Score Mean = 0.77
> Score STD =8.07
>
> Testing set:
> Cases = 12
> Score Mean = -2.08
> Score STD = 7.43
>
> Features:
> Number: 8
> They are:
> Two scores based on word frequency.
> These correlate highly with the real scores.
> The rest are features of the text such as ‘punctuation density’
>
> Evaluation:
> I calculate the prediction accuracy by finding the mean error
> between the prediction (machine score) and target (real
> human score).
>
> Methods, In order of success:
> Linear Regression (OLS):
> Code: linear_model.LinearRegression()
> Result:
> Training Set Mean Error: 7.51
> Training Set STDV of Error: 6.58
> Testing Set Mean Error: 90.29
> Testing Set STDV of Error: 11.26
> Support Vector Regression (SVR), Linear:
> Code: SVR(kernel="linear")
> Result:
> Training Set Mean Error: 8.17
> Training Set STDV of Error: 8.55
> Testing Set Mean Error: 89.93
> Testing Set STDV of Error: 11.12
> Ridge Regression:
> Code: linear_model.Ridge()
> Result:
> Training Set Mean Error: 8.39
> Training Set STDV of Error: 7.46
> Testing Set Mean Error: 90.65
> Testing Set STDV of Error: 11.13
> Support Vector Regression (SVR), 2nd degree polynomial:
> Code: SVR(kernel="poly", degree=2)
> Result:
> Training Set Mean Error: 9.16
> Training Set STDV of Error: 7.35
> Testing Set Mean Error: 107.31
> Testing Set STDV of Error: 35.19
>
> But as you can see, the predictions are absolutely terrible, no matter
> what I do.
> The training set predictions are quite accurate though. From my reading,
> this could be due to over fitting. However, I don’t see how simple
> linear model (OLS) could over fit anything… On top of that, the features
> I’m working with lends itself to prediction (the features based on word
> frequencies in particular correlate highly with the real human scores –
> I’ve even tried Neural Networks in SPSS using the default settings and
> the training set prediction works well. But I can’t get anything to work
> well here in SciKit-Learn.
>
> So what’s the problem?
>
> Thanks so much,
>
> Zach
>
> ------------------------------------------------------------------------------
> Live Security Virtual Conference
> Exclusive live event will cover all the ways today's security and
> threat landscape has changed and how IT managers can respond. Discussions
> will include endpoint security, mobile security and the latest in malware
> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general



-- 
Peter Prettenhofer

------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to