Thanks for the details. My main advice is still the same: try on small subsamples with increasing sizes and check the impact of the size of the training set on the test score.
For a linear binary classifier I am pretty sure that it's not going to help you to use all the data (unless you learn non-linear features from the data). For the 100 dimensions datasets, you should try ExtraTreesClassifier on a those subsamples rather than linear models. > All of the above. There are different categorical variables with cardinality > anywhere between 2 and 200,000. > > I would also like to try NMF on the data. Do you think the scikit-learn > implementation could work with 100,000 sparse features on 1 billion rows? I am pretty sure it won't :) The current implementation is a batch method. Also it very much depends on the number of components you want to extract (the dimensionality of the new latent space). You can try to build new features by: 1- train a minibatch kmeans model with 1000 clusters 2- extract new features by computing the cosine similarity of each sample to the 1000 cluster centers 3- threshold at 0: zero out negative cosine feature values Alternatively you can try the new BernoulliRBM model from scikit-learn 0.14 as a non linear feature extractor from the original sparse categorical features. However, although the RBM training algorithm is online the implementation in scikit-learn does not have partial_fit method (yet), so you won't be able to use directly just with the public API. I might be worth investing time in writing the incremental partial_fit method. Most of the work is already implemented in the _fit private method though. -- Olivier http://twitter.com/ogrisel - http://github.com/ogrisel ------------------------------------------------------------------------------ Introducing Performance Central, a new site from SourceForge and AppDynamics. Performance Central is your source for news, insights, analysis and resources for efficient Application Performance Management. Visit us today! http://pubads.g.doubleclick.net/gampad/clk?id=48897511&iu=/4140/ostg.clktrk _______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
