Hi Youssef,
please make sure that you use the latest version of sklearn (>= 0.13) - we
did some enhancements to the sub-sampling procedure lately.
Looking at the RandomForest code - it seems that the jobs=-1 should not be
the issue for the parallel training of the trees since ``n_jobs =
min(cpu_count(), self.n_estimators)`` which should be just 3 in your case,
however, it will use cpu_count() processes to sort the feature values - so
the bottleneck might be here. Please try to set the n_jobs parameter to a
smaller constant (e.g. 4) and check if it works better.
having said that: 1E8 samples is pretty large - the largest dataset that
I've used so far was merely 1E6 but I've heard that people have used it for
larger datasets too (probably not 1E8 though).
Running the code on a cluster using IPython parallel should not be too hard
- RF is a pretty simple algorithm - you could either patch the existing
code to use IPython parallel instead of Joblib.Parallel (see forest.py) or
simply write you own RF code which directly uses
``DecisionTreeClassifier``. Also, you likely can skip bootstrapping - it
doesn't help much IMHO and can make the implementation a bit more
"involved" - AFAIK the MSR guys didn't used boostrapping for their Kinect
RF system...
When it comes to other implementations you could look at rt-rank [1], which
is a parallel implementation of both GBRT and RF; and WiseRF [2], which is
compatible with sklearn but you have to obtain a license (free trial and
academic version AFAIK).
HTH,
Peter
[1] https://sites.google.com/site/rtranking/
[2] http://about.wise.io/
Am 25.04.2013 03:22 schrieb "Youssef Barhomi" <youssef.barh...@gmail.com>:
> Hello,
>
> I am trying to reproduce the results of this paper:
> http://research.microsoft.com/pubs/145347/BodyPartRecognition.pdf with
> different kinds of data (monkey depth maps instead of humans). So I am
> generating my depth features and training and classifying data with a
> random forest with quite similar parameters of the paper.
>
> I would like to use sklearn.ensemble.RandomForestClassifier with 1E8
> samples with 500 features. Since it seems to be a large dataset of feature
> vectors, I did some trials with smaller subsets (1E4, 1E5, 1E6 samples) and
> the last one seemed to be slower than a O(n_samples*n_features*log(n_samples))
> according to this:
> http://scikit-learn.org/stable/modules/tree.html#complexity since 1E6
> samples are taking a long time and I don't know when they will be done, I
> would like better ways to estimate the ETA or find a way to speed up the
> processing training. Also, I am watching my memory usage and I don't seem
> to be swapping (29GB/48GB being used right now). The other thing is that I
> requested n_jobs = -1 so it could use all cores of my machine (24 cores)
> but looking to my CPU usage, it doesn't seem to be using any of them...
>
> So, do you guys have any ideas on:
> - would a 1E8 samples be doable with your implementation of random forests
> (3 trees , 20 levels deep)?
> - running this code on a cluster using different iPython engines? or would
> that require a lot of work?
> - PCA for dimensionality reduction? (on the paper, they haven't used any
> dim reduction, so I am trying to avoid that)
> - other implementations that I could use for large datasets?
>
> PS: I am very new to this library but I am already impressed!! It's one of
> the cleanest and probably most intuitive machine learning libraries out
> there with a pretty impressive documentation and tutorials. Pretty amazing
> work!!
>
> Thank you very much,
> Youssef
>
>
> ####################################
> #######Here is a code snippet:
> ####################################
>
> from sklearn.datasets import make_classification
> from sklearn.ensemble import RandomForestClassifier
> from sklearn.cross_validation import train_test_split
> from sklearn.preprocessing import StandardScaler
> import time
> import numpy as np
>
> n_samples = 1000
> n_features = 500
> X, y = make_classification(n_samples, n_features, n_redundant=0,
> n_informative=2,
> random_state=1, n_clusters_per_class=1)
> clf = RandomForestClassifier(max_depth=20, n_estimators=3, criterion =
> 'entropy', n_jobs = -1, verbose = 10)
>
> rng = np.random.RandomState(2)
> X += 2 * rng.uniform(size=X.shape)
> linearly_separable = (X, y)
> X = StandardScaler().fit_transform(X)
> X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.4)
> tic = time.time()
> clf.fit(X_train, y_train)
> score = clf.score(X_test, y_test)
> print 'Time taken:', time.time() - tic, 'seconds'
>
>
> --
> Youssef Barhomi, MSc, MEng.
> Research Software Engineer at the CLPS department
> Brown University
> T: +1 (617) 797 9929 | GMT -5:00
>
>
> ------------------------------------------------------------------------------
> Try New Relic Now & We'll Send You this Cool Shirt
> New Relic is the only SaaS-based application performance monitoring service
> that delivers powerful full stack analytics. Optimize and monitor your
> browser, app, & servers with just a few lines of code. Try New Relic
> and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
------------------------------------------------------------------------------
Try New Relic Now & We'll Send You this Cool Shirt
New Relic is the only SaaS-based application performance monitoring service
that delivers powerful full stack analytics. Optimize and monitor your
browser, app, & servers with just a few lines of code. Try New Relic
and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general