2013/4/24 Alex Kopp <ark...@cornell.edu>:
> Thanks, guys.
>
> Perhaps I should explain what I am trying to do and then open it up for
> suggestions.
>
> I have 203k training examples each with 457k features. The features are
> composed of one-hot encoded categorical values as well as stemmed, TFIDF
> weighted unigrams and bigrams (NLP). As you can probably guess, the
> overwhelming majority of the features are the unigrams and bigrams.
>
> In the end, I am looking to build a regression model. I have tried a grid
> search on SGDRegressor, but have not had any promising results (~0.00 or
> even negative R^2 values).
>
> I would appreciate ideas/suggestions.

Have you tried to plot the histogram of the target variable? If it's
highly non gaussian (e.g. positive with a large tail) trying to
predict the log or sqrt might be easier.

Also have you tried a simpler problem such as binary classification:

1- split your training samples in 3 equal subsets:
  A: 1/3 of the samples with the biggest outputs,
  B: 1/3 of the samples with the smallest outputs,
  C: 1/3 for the remaining samples in the middle.

2- discard C and train a binary classifier (e.g. gridsearched
SGDClassifier treating A samples as positive and B samples as
negative).

If you can get past 55% cross validated accuracy on this problem it
probably means that your problem is really hard: either the output
variable is unrelated to the input or the dependency is highly non
linear.

You can also try to do dimensionality reduction by running
MinibatchKMeans on the whole dataset with 1000 centroids. Then compute
the cosine similarity of your samples with those 1000 centroids,
threshold at zero to get positive values and treat those 1000
dimensions as new features for your samples.

Then train a random forest on the new features.

--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

------------------------------------------------------------------------------
Try New Relic Now & We'll Send You this Cool Shirt
New Relic is the only SaaS-based application performance monitoring service 
that delivers powerful full stack analytics. Optimize and monitor your
browser, app, & servers with just a few lines of code. Try New Relic
and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to