Re: [Scikit-learn-general] iPython integration?

Darren Govoni Tue, 08 May 2012 17:57:50 -0700

On Wed, 2012-05-09 at 01:45 +0200, Olivier Grisel wrote:
> 2012/5/8 Darren Govoni <[email protected]>:
> > Still assessing the best models/algorithms to use, but primarily
> > unsupervised learning ones. The models will come from 100's of millions
> > of data points. We're looking at learned bayesian networks, predictive
> > analysis, multivariate analysis and clustering approaches over
> > distributed data.
> 
> How many non-zero features per sample? How many features in total
> (number of input dimensions)? Do you have labels for each sample? If
> so, are they categorical (classification) and how many classes? or are
> they continuous (regression) and if so how many output variables?
I will have better answers for this over time (I'm not a ML expert
myself, yet). One model for learned bayesian networks could have 100's
of thousands of variables (columns): as text tokens. Probably between
250k-300k max. On average it will vary, but grow over time. The text
tokens would represent words and there are finite amount of them. ;)


 We're still experimenting between continuous and discrete values (no
floats though). Each sample will have a rather sparse set of non-zero
values I think. I've gotten good results with simple 0,1 values but am
looking at 0-99 continuous values too.

Without disclosing all the details, the goal of this particular model
would be to discover causal links between variables. We experimented
some with PEBL but it is only one algorithm compared to the power of
scikit. So its why I'm now here learning about it.

> 
> How much data in (GB) does in represent once vectorized as binary or
> numerical feature values?
I'll say 100GB+ is where our requirements are.

> 
> If you want to do supervised learning (regression or classification) I
> would recommend you to do some commandline tests with vowpal wabbit:
> it can handle linear models at a terafeature scale very efficiently.
Will try it. Thanks for the tip.

> Also it does feature extraction from a svmlight-style input format
> that has been extended to handle feature names (e.g. text tokens) and
> feature namespaces and does the vectorization on the go very
> efficiently memory-wise by using feature hashing.
Very interesting. Our space is in text analysis anyway!

> 



------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] iPython integration?

Reply via email to