On Thu, Oct 01, 2015 at 11:10:51AM +0200, Maryam Tavakol wrote: > My problem however is the size of data in terms of number of samples. > The features are engineered and are only 80. I wanted to try training > on bigger set of data for improvement.
I would use the BIRCH clustering method in an online way (using partial fit) to create a coreset: a reduced amount of data points that best represent the original samples with associated weights (corresponding to the number of original data points in each cluster). I would then train the gradient boosted classifier on the reduced data points and use sample weights. Gaƫl ------------------------------------------------------------------------------ _______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general