Re: [Scikit-learn-general] iPython integration?

Olivier Grisel Wed, 09 May 2012 03:39:01 -0700

2012/5/9 Darren Govoni <[email protected]>:
> On Wed, 2012-05-09 at 01:45 +0200, Olivier Grisel wrote:
>> 2012/5/8 Darren Govoni <[email protected]>:
>> > Still assessing the best models/algorithms to use, but primarily
>> > unsupervised learning ones. The models will come from 100's of millions
>> > of data points. We're looking at learned bayesian networks, predictive
>> > analysis, multivariate analysis and clustering approaches over
>> > distributed data.
>>
>> How many non-zero features per sample? How many features in total
>> (number of input dimensions)? Do you have labels for each sample? If
>> so, are they categorical (classification) and how many classes? or are
>> they continuous (regression) and if so how many output variables?
> I will have better answers for this over time (I'm not a ML expert
> myself, yet). One model for learned bayesian networks could have 100's
> of thousands of variables (columns): as text tokens. Probably between
> 250k-300k max. On average it will vary, but grow over time. The text
> tokens would represent words and there are finite amount of them. ;)
>
>  We're still experimenting between continuous and discrete values (no
> floats though). Each sample will have a rather sparse set of non-zero
> values I think. I've gotten good results with simple 0,1 values but am
> looking at 0-99 continuous values too.
>
> Without disclosing all the details, the goal of this particular model
> would be to discover causal links between variables. We experimented
> some with PEBL but it is only one algorithm compared to the power of
> scikit. So its why I'm now here learning about it.
>
>>
>> How much data in (GB) does in represent once vectorized as binary or
>> numerical feature values?
> I'll say 100GB+ is where our requirements are.


For text classification of this scale I would indeed first try to play
with efficient incremental / streaming algorithms like vowpal wabbit
on a single node, maybe on a small subset of the data to get a feel on
what you can expect from the data and how stuff evolve (accuracy
measures, learning time, memory usage, model size) when you slowly
grow the training set size.

In scikit-learn there are some model that should scale OK-ish with
large number of samples (like SGD-based classification and regression
models and minibatch kmeans (with random init). In both case you will
need. Those models are by now way battle tested on large data as
vowpal wabbit is. Furthermore in scikit-learn you would have to use
the `partial_fit` method to incrementally update the model to fit from
chunks of data read from the disk or your database, for instance 1000
documents at a time so as to control memory consumption.

Also another limitation of scikit-learn for analyzing large text
corpora is the current lack of text vectorizer that uses a hash
function instead of an in-memory dictionary-based mapping from the
string representation of tokens to their integer indices used to build
the feature vectors. You should read
http://scikit-learn.org/dev/modules/feature_extraction.html#text-feature-extraction
and the code of the sklearn.feature_extraction.text to understand what
I am talking about.

-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] iPython integration?

Reply via email to