2013/8/23 [email protected] <[email protected]>:
> Good day,
>
> Can anyone perhaps give me an idea of how large datasets scikit-learn
> algorithms typically can handle?
>
> I have about 4 TB of structured data. I might be able to normalize that down
> to say 1 TB if necessary. The tasks would typically be logistic regression,
> Naive Bayes, k-Means and possible others.
>
> Will scikit-learn algorithms be able to handle this on a fairly powerful
> hardware setup?
>
> At which point does it become necessary to look at distributed ML platforms
> e.g. Mahout instead?

It really depends on what you are doing:

- how many samples?
- how many features?
- how many non-zero feature per samples?
- how many target classes? Are they balanced?

What type of data? Just floats? Binary indicator for categorical
features, if so what is the range of cardinality for those categorical
features?

If the number of features is fixed and small-ish (let say less than
one million), then you can train out-of-core models such as
SGDClassifier (with loss='log' for getting a logistic regression
model), MultinomialNB or BernoulliNB (in 0.14 you now have partial_fit
for those guys too) or MinibatchKMeans for unsupervised clustering.

You can have a look at this example for instance:
http://scikit-learn.org/dev/auto_examples/applications/plot_out_of_core_classification.html#example-applications-plot-out-of-core-classification-py

However it's very likely that simple models like linear models and
naive bayes won't benefit from more than a couple of tens of millions
of samples if the number of possible features is small.

So you should first try to learn models on random subsets of your data
that fit in memory and measure the performance of the predictive
accuracy of the model. Then try again with a slightly bigger dataset
(twice as big but that still fit in memory) and measure the
performance again: if you don't see an improvement it means that you
don't need that much data.

If you really need to distribute computation on a cluster you can
train out-of-core linear models on a Spark cluster with PySpark as
Nick explained.

-- 
Olivier

------------------------------------------------------------------------------
Introducing Performance Central, a new site from SourceForge and 
AppDynamics. Performance Central is your source for news, insights, 
analysis and resources for efficient Application Performance Management. 
Visit us today!
http://pubads.g.doubleclick.net/gampad/clk?id=48897511&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to