Re: [Scikit-learn-general] Distributed RandomForests

Youssef Barhomi Thu, 25 Apr 2013 06:51:33 -0700

Hi Brian,
thanks for your feedback. were you able to reproduce their results? how big
was your dataset that you have processed so far with an RF?
the MS people have used a distributed RF, so yes, the features I am
guessing were being computed in parallel on all these cores. Though, I am
still new to the RF algorithm, I wonder how they could parallelise that? I
am guessing by actually sending each tree node to an actual core? Also, I
think they have implemented a gpu version as well of their RF (I am
guessing that is what is actually being used on the xbox itself right now),
and that should pb speed up things. The other option I am guessing is to
use an online RF, any recommendations on that?
thanks a lot!



Y



On Thu, Apr 25, 2013 at 12:54 AM, Brian Holt <bdho...@gmail.com> wrote:

> Hi Youssef,
>
> You're trying to do exactly what I did. First thing to note is that the
> Microsoft guys don't precompute the features, rather they compute them on
> the fly. That means that they only need enough memory to store the depth
> images, and since they have a 1000 core cluster, computing the features is
> much less of a problem for them.
>
> If you profile your program my guess is that you'll find that the
> bottleneck as you scale up to 1M dimensions and higher is the argsorting of
> all your data. I did some work to argsort down a feature only when required
> which made it a bit slower but more tractable. Unfortunately the code base
> has changed a lot since I did that so my PR is out of date. You're welcome
> to pick it up and update it if you want for your own work, although I'm not
> sure it would be accepted upstream.
>
> I'm sorry I can't be more help - it's tricky trying to replicate work when
> you have vastly different tools.
>
> Regards
> Brian
> On Apr 25, 2013 9:22 AM, "Youssef Barhomi" <youssef.barh...@gmail.com>
> wrote:
>
>> Hello,
>>
>> I am trying to reproduce the results of this paper:
>> http://research.microsoft.com/pubs/145347/BodyPartRecognition.pdf with
>> different kinds of data (monkey depth maps instead of humans). So I am
>> generating my depth features and training  and classifying data with a
>> random forest with quite similar parameters of the paper.
>>
>> I would like to use sklearn.ensemble.RandomForestClassifier with 1E8
>> samples with 500 features. Since it seems to be a large dataset of feature
>> vectors, I did some trials with smaller subsets (1E4, 1E5, 1E6 samples) and
>> the last one seemed to be slower than a 
>> O(n_samples*n_features*log(n_samples))
>> according to this:
>> http://scikit-learn.org/stable/modules/tree.html#complexity since 1E6
>> samples are taking a long time and I don't know when they will be done, I
>> would like better ways to estimate the ETA or find a way to speed up the
>> processing training. Also, I am watching my memory usage and I don't
>> seem to be swapping (29GB/48GB being used right now). The other thing is
>> that I requested n_jobs = -1 so it could use all cores of my machine (24
>> cores) but looking to my CPU usage, it doesn't seem to be using any of
>> them...
>>
>> So, do you guys have any ideas on:
>> - would a 1E8 samples be doable with your implementation of random
>> forests (3 trees , 20 levels deep)?
>> - running this code on a cluster using different iPython engines? or
>> would that require a lot of work?
>> - PCA for dimensionality reduction? (on the paper, they haven't used any
>> dim reduction, so I am trying to avoid that)
>> - other implementations that I could use for large datasets?
>>
>> PS: I am very new to this library but I am already impressed!! It's one
>> of the cleanest and probably most intuitive machine learning libraries out
>> there with a pretty impressive documentation and tutorials. Pretty amazing
>> work!!
>>
>> Thank you very much,
>> Youssef
>>
>>
>> ####################################
>> #######Here is a code snippet:
>> ####################################
>>
>> from sklearn.datasets import make_classification
>> from sklearn.ensemble import RandomForestClassifier
>> from sklearn.cross_validation import train_test_split
>> from sklearn.preprocessing import StandardScaler
>> import time
>> import numpy as np
>>
>> n_samples = 1000
>> n_features = 500
>> X, y = make_classification(n_samples, n_features, n_redundant=0,
>> n_informative=2,
>>                                random_state=1, n_clusters_per_class=1)
>> clf = RandomForestClassifier(max_depth=20, n_estimators=3, criterion =
>> 'entropy', n_jobs = -1, verbose = 10)
>>
>> rng = np.random.RandomState(2)
>> X += 2 * rng.uniform(size=X.shape)
>> linearly_separable = (X, y)
>> X = StandardScaler().fit_transform(X)
>> X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.4)
>> tic = time.time()
>> clf.fit(X_train, y_train)
>> score = clf.score(X_test, y_test)
>> print 'Time taken:', time.time() - tic, 'seconds'
>>
>>
>> --
>> Youssef Barhomi, MSc, MEng.
>> Research Software Engineer at the CLPS department
>> Brown University
>> T: +1 (617) 797 9929  | GMT -5:00
>>
>>
>> ------------------------------------------------------------------------------
>> Try New Relic Now & We'll Send You this Cool Shirt
>> New Relic is the only SaaS-based application performance monitoring
>> service
>> that delivers powerful full stack analytics. Optimize and monitor your
>> browser, app, & servers with just a few lines of code. Try New Relic
>> and get this awesome Nerd Life shirt!
>> http://p.sf.net/sfu/newrelic_d2d_apr
>> _______________________________________________
>> Scikit-learn-general mailing list
>> Scikit-learn-general@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>>
>
> ------------------------------------------------------------------------------
> Try New Relic Now & We'll Send You this Cool Shirt
> New Relic is the only SaaS-based application performance monitoring service
> that delivers powerful full stack analytics. Optimize and monitor your
> browser, app, & servers with just a few lines of code. Try New Relic
> and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>


-- 
Youssef Barhomi, MSc, MEng.
Research Software Engineer at the CLPS department
Brown University
T: +1 (617) 797 9929  | GMT -5:00

------------------------------------------------------------------------------
Try New Relic Now & We'll Send You this Cool Shirt
New Relic is the only SaaS-based application performance monitoring service 
that delivers powerful full stack analytics. Optimize and monitor your
browser, app, & servers with just a few lines of code. Try New Relic
and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr

_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Distributed RandomForests

Reply via email to