I have tried various machine learning algorithms from scikit learn but can't find a good prediction model. The features I'm using are the tf-idf of set of text documents, correlated with human ratings assigned to each document. I'm thinking that I must be doing something wrong as the scores can't be that bad (not to mention negative?)
If someone could have a look at it, I'd really appreciate it. I didn't upload to a github gist because they won't let me upload the dataset directory. So I've uploaded my really short code (regression.py) AND the original data set (/texts) here (625K): https://dl.dropbox.com/u/74279156/regression.zip This is my output: C:\python code\program>python regression.py loading texts... n_samples: 53, n_features: 6284 LinearRegresson [ 0.34662496 0.23446674 0.30332109 0.3163838 0.01607913] Accuracy: 0.24 (+/- 0.06) SVR linear [-0.05521329 -1.61280714 -0.67428098 -0.8805647 -2.20730703] Accuracy: -1.09 (+/- 0.37) SVR poly 4 degrees [-0.18814233 -1.78480475 -0.88158686 -1.05944432 -2.40284073] Accuracy: -1.26 (+/- 0.38) SVR sigmoid [-0.18814233 -1.78480475 -0.88158686 -1.05944432 -2.40284073] Accuracy: -1.26 (+/- 0.38) Please tell me what's wrong.. I'm dying to know how to get scikit-lean to predict based on this dataset. Thanks Zach ------------------------------------------------------------------------------ Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ _______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
