2013/4/24 Alex Kopp <ark...@cornell.edu>: > Thanks, guys. > > Perhaps I should explain what I am trying to do and then open it up for > suggestions. > > I have 203k training examples each with 457k features. The features are > composed of one-hot encoded categorical values as well as stemmed, TFIDF > weighted unigrams and bigrams (NLP). As you can probably guess, the > overwhelming majority of the features are the unigrams and bigrams. > > In the end, I am looking to build a regression model. I have tried a grid > search on SGDRegressor, but have not had any promising results (~0.00 or > even negative R^2 values). > > I would appreciate ideas/suggestions.
Have you tried to plot the histogram of the target variable? If it's highly non gaussian (e.g. positive with a large tail) trying to predict the log or sqrt might be easier. Also have you tried a simpler problem such as binary classification: 1- split your training samples in 3 equal subsets: A: 1/3 of the samples with the biggest outputs, B: 1/3 of the samples with the smallest outputs, C: 1/3 for the remaining samples in the middle. 2- discard C and train a binary classifier (e.g. gridsearched SGDClassifier treating A samples as positive and B samples as negative). If you can get past 55% cross validated accuracy on this problem it probably means that your problem is really hard: either the output variable is unrelated to the input or the dependency is highly non linear. You can also try to do dimensionality reduction by running MinibatchKMeans on the whole dataset with 1000 centroids. Then compute the cosine similarity of your samples with those 1000 centroids, threshold at zero to get positive values and treat those 1000 dimensions as new features for your samples. Then train a random forest on the new features. -- Olivier http://twitter.com/ogrisel - http://github.com/ogrisel ------------------------------------------------------------------------------ Try New Relic Now & We'll Send You this Cool Shirt New Relic is the only SaaS-based application performance monitoring service that delivers powerful full stack analytics. Optimize and monitor your browser, app, & servers with just a few lines of code. Try New Relic and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr _______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general