Zach,

can you check the construction of your test set - The CV scores on the
training set are OK but when I apply the model to the test set it
fails miserably - looking at the feature distribution between training
and test set indicates that there's something fishy but its hard to
tell given the small test set size. How did you generate the test set
(maybe you got bitten by sample selection bias)? Did you apply any
transformation to the two sets? Does the SPSS script you were
referring to use _exactly_ the same csv files?

best,
 Peter

2012/8/10 Zach Bastick <[email protected]>:
> Hi David
>
> I did try Ridge Regression as per my original message, but didn't get
> any results. Maybe I'm implementing it incorrecly.
>
> Generally, the data set should work fine I think. I've correlated the
> features to the dependent variable and get high correlations for them
> independently. I'm not sure why the predictions are so off..
>
> Zach
>
>
> On 09/08/2012 18:20, David Warde-Farley wrote:
>> On Thu, Aug 9, 2012 at 2:08 PM, Zach Bastick <[email protected]> wrote:
>>
>>> But as you can see, the predictions are absolutely terrible, no matter
>>> what I do.
>>> The training set predictions are quite accurate though. From my reading,
>>> this could be due to over fitting. However, I don’t see how simple
>>> linear model (OLS) could over fit anything…
>> Quite easily. Overfitting is not really a property of the model family
>> so much as how robustly you can estimate parameters given an amount of
>> training data. Remember that the coefficients you are fitting specify a
>> hyperplane in high dimensional space (where your geometric intuitions
>> will often fail you), and unless your problem is truly linear and also
>> noiseless, the OLS coefficients estimated on a finite training set are but
>> an approximation of the "best" hyperplane for your problem (in terms of
>> minimizing generalization error).
>>
>> With 8 features and only 35 cases, you're estimating 9
>> parameters from very little data, and you will most likely need heavy
>> regularization to avoid the model overspecializing to the quirks of
>> the training set. I would try out ridge regression as a first pass,
>> and maybe the Lasso.
>>
>> Cheers,
>>
>> David
>>
>> ------------------------------------------------------------------------------
>> Live Security Virtual Conference
>> Exclusive live event will cover all the ways today's security and
>> threat landscape has changed and how IT managers can respond. Discussions
>> will include endpoint security, mobile security and the latest in malware
>> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
>> _______________________________________________
>> Scikit-learn-general mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
> ------------------------------------------------------------------------------
> Live Security Virtual Conference
> Exclusive live event will cover all the ways today's security and
> threat landscape has changed and how IT managers can respond. Discussions
> will include endpoint security, mobile security and the latest in malware
> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general



-- 
Peter Prettenhofer

------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to