Thanks for the details. My main advice is still the same: try on small
subsamples with increasing sizes and check the impact of the size of
the training set on the test score.

For a linear binary classifier I am pretty sure that it's not going to
help you to use all the data (unless you learn non-linear features
from the data).

For the 100 dimensions datasets, you should try ExtraTreesClassifier
on a those subsamples rather than linear models.

> All of the above. There are different categorical variables with cardinality
> anywhere between 2 and 200,000.
>
> I would also like to try NMF on the data. Do you think the scikit-learn
> implementation could work with 100,000 sparse features on 1 billion rows?

I am pretty sure it won't :) The current implementation is a batch method.

Also it very much depends on the number of components you want to
extract (the dimensionality of the new latent space).

You can try to build new features by:

1- train a minibatch kmeans model with 1000 clusters
2- extract new features by computing the cosine similarity of each
sample to the 1000 cluster centers
3- threshold at 0: zero out negative cosine feature values

Alternatively you can try the new BernoulliRBM model from scikit-learn
0.14 as a non linear feature extractor from the original sparse
categorical features. However, although the RBM training algorithm is
online the implementation in scikit-learn does not have partial_fit
method (yet), so you won't be able to use directly just with the
public API. I might be worth investing time in writing the incremental
partial_fit method. Most of the work is already implemented in the
_fit private method though.

-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

------------------------------------------------------------------------------
Introducing Performance Central, a new site from SourceForge and 
AppDynamics. Performance Central is your source for news, insights, 
analysis and resources for efficient Application Performance Management. 
Visit us today!
http://pubads.g.doubleclick.net/gampad/clk?id=48897511&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to