2013/8/23 [email protected] <[email protected]>: > Good day, > > Can anyone perhaps give me an idea of how large datasets scikit-learn > algorithms typically can handle? > > I have about 4 TB of structured data. I might be able to normalize that down > to say 1 TB if necessary. The tasks would typically be logistic regression, > Naive Bayes, k-Means and possible others. > > Will scikit-learn algorithms be able to handle this on a fairly powerful > hardware setup? > > At which point does it become necessary to look at distributed ML platforms > e.g. Mahout instead?
It really depends on what you are doing: - how many samples? - how many features? - how many non-zero feature per samples? - how many target classes? Are they balanced? What type of data? Just floats? Binary indicator for categorical features, if so what is the range of cardinality for those categorical features? If the number of features is fixed and small-ish (let say less than one million), then you can train out-of-core models such as SGDClassifier (with loss='log' for getting a logistic regression model), MultinomialNB or BernoulliNB (in 0.14 you now have partial_fit for those guys too) or MinibatchKMeans for unsupervised clustering. You can have a look at this example for instance: http://scikit-learn.org/dev/auto_examples/applications/plot_out_of_core_classification.html#example-applications-plot-out-of-core-classification-py However it's very likely that simple models like linear models and naive bayes won't benefit from more than a couple of tens of millions of samples if the number of possible features is small. So you should first try to learn models on random subsets of your data that fit in memory and measure the performance of the predictive accuracy of the model. Then try again with a slightly bigger dataset (twice as big but that still fit in memory) and measure the performance again: if you don't see an improvement it means that you don't need that much data. If you really need to distribute computation on a cluster you can train out-of-core linear models on a Spark cluster with PySpark as Nick explained. -- Olivier ------------------------------------------------------------------------------ Introducing Performance Central, a new site from SourceForge and AppDynamics. Performance Central is your source for news, insights, analysis and resources for efficient Application Performance Management. Visit us today! http://pubads.g.doubleclick.net/gampad/clk?id=48897511&iu=/4140/ostg.clktrk _______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
