2012/5/8 Darren Govoni <[email protected]>:
> Still assessing the best models/algorithms to use, but primarily
> unsupervised learning ones. The models will come from 100's of millions
> of data points. We're looking at learned bayesian networks, predictive
> analysis, multivariate analysis and clustering approaches over
> distributed data.

How many non-zero features per sample? How many features in total
(number of input dimensions)? Do you have labels for each sample? If
so, are they categorical (classification) and how many classes? or are
they continuous (regression) and if so how many output variables?

How much data in (GB) does in represent once vectorized as binary or
numerical feature values?

If you want to do supervised learning (regression or classification) I
would recommend you to do some commandline tests with vowpal wabbit:
it can handle linear models at a terafeature scale very efficiently.
Also it does feature extraction from a svmlight-style input format
that has been extended to handle feature names (e.g. text tokens) and
feature namespaces and does the vectorization on the go very
efficiently memory-wise by using feature hashing.

-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to