Re: [Scikit-learn-general] Distributed RandomForests

Youssef Barhomi Thu, 25 Apr 2013 06:46:11 -0700

ohh makes total sense now!! thank you Gilles!!
Y


On Thu, Apr 25, 2013 at 2:38 AM, Gilles Louppe <g.lou...@gmail.com> wrote:

> Hi Youssef,
>
> Regarding memory usage, you should know that it'll basically blow up if
> you increase the number of jobs. With the current implementation, you'll
> need O(n_jobs * |X| * 2) in memory space (where |X| is the size of X, in
> bytes). That issue stems from the use of joblib which basically forces us
> to dupplicate the dataset as many times as you spawn processes. In the end,
> this also induces a huge overhead in terms of CPU time (because of the back
> and forth transfers of all these huge Python objects).
>
> There is one PR (https://github.com/joblib/joblib/pull/44) that tries to
> solve that by allowing objects to put into shared memory segments, but it
> is still a work in progress though.
>
> Gilles
>
>
> On 25 April 2013 06:54, Brian Holt <bdho...@gmail.com> wrote:
>
>> Hi Youssef,
>>
>> You're trying to do exactly what I did. First thing to note is that the
>> Microsoft guys don't precompute the features, rather they compute them on
>> the fly. That means that they only need enough memory to store the depth
>> images, and since they have a 1000 core cluster, computing the features is
>> much less of a problem for them.
>>
>> If you profile your program my guess is that you'll find that the
>> bottleneck as you scale up to 1M dimensions and higher is the argsorting of
>> all your data. I did some work to argsort down a feature only when required
>> which made it a bit slower but more tractable. Unfortunately the code base
>> has changed a lot since I did that so my PR is out of date. You're welcome
>> to pick it up and update it if you want for your own work, although I'm not
>> sure it would be accepted upstream.
>>
>> I'm sorry I can't be more help - it's tricky trying to replicate work
>> when you have vastly different tools.
>>
>> Regards
>> Brian
>> On Apr 25, 2013 9:22 AM, "Youssef Barhomi" <youssef.barh...@gmail.com>
>> wrote:
>>
>>> Hello,
>>>
>>> I am trying to reproduce the results of this paper:
>>> http://research.microsoft.com/pubs/145347/BodyPartRecognition.pdf with
>>> different kinds of data (monkey depth maps instead of humans). So I am
>>> generating my depth features and training  and classifying data with a
>>> random forest with quite similar parameters of the paper.
>>>
>>> I would like to use sklearn.ensemble.RandomForestClassifier with 1E8
>>> samples with 500 features. Since it seems to be a large dataset of feature
>>> vectors, I did some trials with smaller subsets (1E4, 1E5, 1E6 samples) and
>>> the last one seemed to be slower than a 
>>> O(n_samples*n_features*log(n_samples))
>>> according to this:
>>> http://scikit-learn.org/stable/modules/tree.html#complexity since 1E6
>>> samples are taking a long time and I don't know when they will be done, I
>>> would like better ways to estimate the ETA or find a way to speed up the
>>> processing training. Also, I am watching my memory usage and I don't
>>> seem to be swapping (29GB/48GB being used right now). The other thing is
>>> that I requested n_jobs = -1 so it could use all cores of my machine (24
>>> cores) but looking to my CPU usage, it doesn't seem to be using any of
>>> them...
>>>
>>> So, do you guys have any ideas on:
>>> - would a 1E8 samples be doable with your implementation of random
>>> forests (3 trees , 20 levels deep)?
>>> - running this code on a cluster using different iPython engines? or
>>> would that require a lot of work?
>>> - PCA for dimensionality reduction? (on the paper, they haven't used any
>>> dim reduction, so I am trying to avoid that)
>>> - other implementations that I could use for large datasets?
>>>
>>> PS: I am very new to this library but I am already impressed!! It's one
>>> of the cleanest and probably most intuitive machine learning libraries out
>>> there with a pretty impressive documentation and tutorials. Pretty amazing
>>> work!!
>>>
>>> Thank you very much,
>>> Youssef
>>>
>>>
>>> ####################################
>>> #######Here is a code snippet:
>>> ####################################
>>>
>>> from sklearn.datasets import make_classification
>>> from sklearn.ensemble import RandomForestClassifier
>>> from sklearn.cross_validation import train_test_split
>>> from sklearn.preprocessing import StandardScaler
>>> import time
>>> import numpy as np
>>>
>>> n_samples = 1000
>>> n_features = 500
>>> X, y = make_classification(n_samples, n_features, n_redundant=0,
>>> n_informative=2,
>>>                                random_state=1, n_clusters_per_class=1)
>>> clf = RandomForestClassifier(max_depth=20, n_estimators=3, criterion =
>>> 'entropy', n_jobs = -1, verbose = 10)
>>>
>>> rng = np.random.RandomState(2)
>>> X += 2 * rng.uniform(size=X.shape)
>>> linearly_separable = (X, y)
>>> X = StandardScaler().fit_transform(X)
>>> X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.4)
>>> tic = time.time()
>>> clf.fit(X_train, y_train)
>>> score = clf.score(X_test, y_test)
>>> print 'Time taken:', time.time() - tic, 'seconds'
>>>
>>>
>>> --
>>> Youssef Barhomi, MSc, MEng.
>>> Research Software Engineer at the CLPS department
>>> Brown University
>>> T: +1 (617) 797 9929  | GMT -5:00
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> Try New Relic Now & We'll Send You this Cool Shirt
>>> New Relic is the only SaaS-based application performance monitoring
>>> service
>>> that delivers powerful full stack analytics. Optimize and monitor your
>>> browser, app, & servers with just a few lines of code. Try New Relic
>>> and get this awesome Nerd Life shirt!
>>> http://p.sf.net/sfu/newrelic_d2d_apr
>>> _______________________________________________
>>> Scikit-learn-general mailing list
>>> Scikit-learn-general@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>
>>>
>>
>> ------------------------------------------------------------------------------
>> Try New Relic Now & We'll Send You this Cool Shirt
>> New Relic is the only SaaS-based application performance monitoring
>> service
>> that delivers powerful full stack analytics. Optimize and monitor your
>> browser, app, & servers with just a few lines of code. Try New Relic
>> and get this awesome Nerd Life shirt!
>> http://p.sf.net/sfu/newrelic_d2d_apr
>> _______________________________________________
>> Scikit-learn-general mailing list
>> Scikit-learn-general@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>>
>
>
> ------------------------------------------------------------------------------
> Try New Relic Now & We'll Send You this Cool Shirt
> New Relic is the only SaaS-based application performance monitoring service
> that delivers powerful full stack analytics. Optimize and monitor your
> browser, app, & servers with just a few lines of code. Try New Relic
> and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>


-- 
Youssef Barhomi, MSc, MEng.
Research Software Engineer at the CLPS department
Brown University
T: +1 (617) 797 9929  | GMT -5:00

------------------------------------------------------------------------------
Try New Relic Now & We'll Send You this Cool Shirt
New Relic is the only SaaS-based application performance monitoring service 
that delivers powerful full stack analytics. Optimize and monitor your
browser, app, & servers with just a few lines of code. Try New Relic
and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr

_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Distributed RandomForests

Reply via email to