Re: [Scikit-learn-general] Distributed RandomForests

Youssef Barhomi Thu, 25 Apr 2013 06:42:57 -0700

thank you very much Peter,

you are right about the n_jobs, something was going wrong with that. When
n_jobs = -1, for larger dataset (1E6 for this case), no cpu was being used
and the process was hanging for a while. getting n_jobs = 1 made everything
work.
yes, I will look into the iPython parallel and see if I can do that.
I have just tried wiseRF and it worked like a charm with almost the same
accuracy as the RF on sklearn with a 6x speedup so far. I was able to run a
1E6 x 500 dataset in 45 seconds with 14GB of RAM being used. I will try
rtranking sometimes today. Now I am memory bound obviously, would you
recommend an online RF library at this point?



 Am 25.04.2013 03:22 schrieb "Youssef Barhomi" <youssef.barh...@gmail.com>:
>
>> Hello,
>>
>> I am trying to reproduce the results of this paper:
>> http://research.microsoft.com/pubs/145347/BodyPartRecognition.pdf with
>> different kinds of data (monkey depth maps instead of humans). So I am
>> generating my depth features and training  and classifying data with a
>> random forest with quite similar parameters of the paper.
>>
>> I would like to use sklearn.ensemble.RandomForestClassifier with 1E8
>> samples with 500 features. Since it seems to be a large dataset of feature
>> vectors, I did some trials with smaller subsets (1E4, 1E5, 1E6 samples) and
>> the last one seemed to be slower than a 
>> O(n_samples*n_features*log(n_samples))
>> according to this:
>> http://scikit-learn.org/stable/modules/tree.html#complexity since 1E6
>> samples are taking a long time and I don't know when they will be done, I
>> would like better ways to estimate the ETA or find a way to speed up the
>> processing training. Also, I am watching my memory usage and I don't
>> seem to be swapping (29GB/48GB being used right now). The other thing is
>> that I requested n_jobs = -1 so it could use all cores of my machine (24
>> cores) but looking to my CPU usage, it doesn't seem to be using any of
>> them...
>>
>> So, do you guys have any ideas on:
>> - would a 1E8 samples be doable with your implementation of random
>> forests (3 trees , 20 levels deep)?
>> - running this code on a cluster using different iPython engines? or
>> would that require a lot of work?
>> - PCA for dimensionality reduction? (on the paper, they haven't used any
>> dim reduction, so I am trying to avoid that)
>> - other implementations that I could use for large datasets?
>>
>> PS: I am very new to this library but I am already impressed!! It's one
>> of the cleanest and probably most intuitive machine learning libraries out
>> there with a pretty impressive documentation and tutorials. Pretty amazing
>> work!!
>>
>> Thank you very much,
>> Youssef
>>
>>
>> ####################################
>> #######Here is a code snippet:
>> ####################################
>>
>> from sklearn.datasets import make_classification
>> from sklearn.ensemble import RandomForestClassifier
>> from sklearn.cross_validation import train_test_split
>> from sklearn.preprocessing import StandardScaler
>> import time
>> import numpy as np
>>
>> n_samples = 1000
>> n_features = 500
>> X, y = make_classification(n_samples, n_features, n_redundant=0,
>> n_informative=2,
>>                                random_state=1, n_clusters_per_class=1)
>> clf = RandomForestClassifier(max_depth=20, n_estimators=3, criterion =
>> 'entropy', n_jobs = -1, verbose = 10)
>>
>> rng = np.random.RandomState(2)
>> X += 2 * rng.uniform(size=X.shape)
>> linearly_separable = (X, y)
>> X = StandardScaler().fit_transform(X)
>> X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.4)
>> tic = time.time()
>> clf.fit(X_train, y_train)
>> score = clf.score(X_test, y_test)
>> print 'Time taken:', time.time() - tic, 'seconds'
>>
>>
>> --
>> Youssef Barhomi, MSc, MEng.
>> Research Software Engineer at the CLPS department
>> Brown University
>> T: +1 (617) 797 9929  | GMT -5:00
>>
>>
>> ------------------------------------------------------------------------------
>> Try New Relic Now & We'll Send You this Cool Shirt
>> New Relic is the only SaaS-based application performance monitoring
>> service
>> that delivers powerful full stack analytics. Optimize and monitor your
>> browser, app, & servers with just a few lines of code. Try New Relic
>> and get this awesome Nerd Life shirt!
>> http://p.sf.net/sfu/newrelic_d2d_apr
>> _______________________________________________
>> Scikit-learn-general mailing list
>> Scikit-learn-general@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>>
>
> ------------------------------------------------------------------------------
> Try New Relic Now & We'll Send You this Cool Shirt
> New Relic is the only SaaS-based application performance monitoring service
> that delivers powerful full stack analytics. Optimize and monitor your
> browser, app, & servers with just a few lines of code. Try New Relic
> and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>


-- 
Youssef Barhomi, MSc, MEng.
Research Software Engineer at the CLPS department
Brown University
T: +1 (617) 797 9929  | GMT -5:00

------------------------------------------------------------------------------
Try New Relic Now & We'll Send You this Cool Shirt
New Relic is the only SaaS-based application performance monitoring service 
that delivers powerful full stack analytics. Optimize and monitor your
browser, app, & servers with just a few lines of code. Try New Relic
and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr

_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Distributed RandomForests

Reply via email to