Re: [Scikit-learn-general] Scikit-learn for large datasets?

amir rahimi Fri, 23 Aug 2013 10:15:14 -0700

Hi,

Helge, this ECML/PKDD paper [1] might be helpful in the case of
semi-supervised learning.


Sometimes ago me and one of the authors of [1] talked about implementing
the algorithm in sklearn. I think now is a good time to mention it in the
mailing list. I'm not sure if there is any online semi-supervised learning
method in sklearn right now. It would be grateful if one of the core
developers tell me if that is interesting to them.


[1] "Manifold Coarse Graining for online semi-supervised learning".
http://link.springer.com/chapter/10.1007/978-3-642-23780-5_35#page-1

Best,
Amir


On Fri, Aug 23, 2013 at 5:33 PM, Olivier Grisel <[email protected]>wrote:

> Thanks for the details. My main advice is still the same: try on small
> subsamples with increasing sizes and check the impact of the size of
> the training set on the test score.
>
> For a linear binary classifier I am pretty sure that it's not going to
> help you to use all the data (unless you learn non-linear features
> from the data).
>
> For the 100 dimensions datasets, you should try ExtraTreesClassifier
> on a those subsamples rather than linear models.
>
> > All of the above. There are different categorical variables with
> cardinality
> > anywhere between 2 and 200,000.
> >
> > I would also like to try NMF on the data. Do you think the scikit-learn
> > implementation could work with 100,000 sparse features on 1 billion rows?
>
> I am pretty sure it won't :) The current implementation is a batch method.
>
> Also it very much depends on the number of components you want to
> extract (the dimensionality of the new latent space).
>
> You can try to build new features by:
>
> 1- train a minibatch kmeans model with 1000 clusters
> 2- extract new features by computing the cosine similarity of each
> sample to the 1000 cluster centers
> 3- threshold at 0: zero out negative cosine feature values
>
> Alternatively you can try the new BernoulliRBM model from scikit-learn
> 0.14 as a non linear feature extractor from the original sparse
> categorical features. However, although the RBM training algorithm is
> online the implementation in scikit-learn does not have partial_fit
> method (yet), so you won't be able to use directly just with the
> public API. I might be worth investing time in writing the incremental
> partial_fit method. Most of the work is already implemented in the
> _fit private method though.
>
> --
> Olivier
> http://twitter.com/ogrisel - http://github.com/ogrisel
>
>
> ------------------------------------------------------------------------------
> Introducing Performance Central, a new site from SourceForge and
> AppDynamics. Performance Central is your source for news, insights,
> analysis and resources for efficient Application Performance Management.
> Visit us today!
> http://pubads.g.doubleclick.net/gampad/clk?id=48897511&iu=/4140/ostg.clktrk
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>



-- 
----------------------------------------------------------------------
#include <stdio.h>
double d[]={9299037773.178347,2226415.983937417,307.0};
main(){d[2]--?d[0]*=4,d[1]*=5,main():printf((char*)d);}
----------------------------------------------------------------------

------------------------------------------------------------------------------
Introducing Performance Central, a new site from SourceForge and 
AppDynamics. Performance Central is your source for news, insights, 
analysis and resources for efficient Application Performance Management. 
Visit us today!
http://pubads.g.doubleclick.net/gampad/clk?id=48897511&iu=/4140/ostg.clktrk

_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Scikit-learn for large datasets?

Reply via email to