Re: [Scikit-learn-general] Distributed RandomForests

Youssef Barhomi Mon, 29 Apr 2013 07:23:34 -0700

Thank you Andreas!


On Sat, Apr 27, 2013 at 2:03 PM, Andreas Mueller
<amuel...@ais.uni-bonn.de>wrote:

>  Hi Youssef.
> I would strongly advise you to use a image specific random forest
> implementation.
> There is a very good implementation by some other MSRC people:
>
> http://research.microsoft.com/en-us/downloads/03e0ca05-8aa9-49f6-801f-bb23846dc147/
> It implements a much more complicated model, decision tree fields, but can
> also be used for plain random forests.
>
> Cheers,
> Andy
>
>
> On 04/25/2013 03:19 AM, Youssef Barhomi wrote:
>
>  Hello,
>
>  I am trying to reproduce the results of this paper:
> http://research.microsoft.com/pubs/145347/BodyPartRecognition.pdf with
> different kinds of data (monkey depth maps instead of humans). So I am
> generating my depth features and training  and classifying data with a
> random forest with quite similar parameters of the paper.
>
>  I would like to use sklearn.ensemble.RandomForestClassifier with 1E8
> samples with 500 features. Since it seems to be a large dataset of feature
> vectors, I did some trials with smaller subsets (1E4, 1E5, 1E6 samples) and
> the last one seemed to be slower than a O(n_samples*n_features*log(n_samples))
> according to this:
> http://scikit-learn.org/stable/modules/tree.html#complexity since 1E6
> samples are taking a long time and I don't know when they will be done, I
> would like better ways to estimate the ETA or find a way to speed up the
> processing training. Also, I am watching my memory usage and I don't seem
> to be swapping (29GB/48GB being used right now). The other thing is that I
> requested n_jobs = -1 so it could use all cores of my machine (24 cores)
> but looking to my CPU usage, it doesn't seem to be using any of them...
>
>  So, do you guys have any ideas on:
>  - would a 1E8 samples be doable with your implementation of random
> forests (3 trees , 20 levels deep)?
> - running this code on a cluster using different iPython engines? or would
> that require a lot of work?
>  - PCA for dimensionality reduction? (on the paper, they haven't used any
> dim reduction, so I am trying to avoid that)
> - other implementations that I could use for large datasets?
>
>  PS: I am very new to this library but I am already impressed!! It's one
> of the cleanest and probably most intuitive machine learning libraries out
> there with a pretty impressive documentation and tutorials. Pretty amazing
> work!!
>
>  Thank you very much,
>  Youssef
>
>
>  ####################################
> #######Here is a code snippet:
>  ####################################
>
>  from sklearn.datasets import make_classification
>  from sklearn.ensemble import RandomForestClassifier
>  from sklearn.cross_validation import train_test_split
>  from sklearn.preprocessing import StandardScaler
>  import time
> import numpy as np
>
>  n_samples = 1000
> n_features = 500
>  X, y = make_classification(n_samples, n_features, n_redundant=0,
> n_informative=2,
>                                random_state=1, n_clusters_per_class=1)
> clf = RandomForestClassifier(max_depth=20, n_estimators=3, criterion =
> 'entropy', n_jobs = -1, verbose = 10)
>
>   rng = np.random.RandomState(2)
> X += 2 * rng.uniform(size=X.shape)
> linearly_separable = (X, y)
>  X = StandardScaler().fit_transform(X)
> X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.4)
>  tic = time.time()
> clf.fit(X_train, y_train)
> score = clf.score(X_test, y_test)
> print 'Time taken:', time.time() - tic, 'seconds'
>
>
>  --
>  Youssef Barhomi, MSc, MEng.
> Research Software Engineer at the CLPS department
> Brown University
> T: +1 (617) 797 9929  | GMT -5:00
>
>
> ------------------------------------------------------------------------------
> Try New Relic Now & We'll Send You this Cool Shirt
> New Relic is the only SaaS-based application performance monitoring service
> that delivers powerful full stack analytics. Optimize and monitor your
> browser, app, & servers with just a few lines of code. Try New Relic
> and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr
>
>
>
> _______________________________________________
> Scikit-learn-general mailing 
> listScikit-learn-general@lists.sourceforge.nethttps://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
>
>
> ------------------------------------------------------------------------------
> Try New Relic Now & We'll Send You this Cool Shirt
> New Relic is the only SaaS-based application performance monitoring service
> that delivers powerful full stack analytics. Optimize and monitor your
> browser, app, & servers with just a few lines of code. Try New Relic
> and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>


-- 
Youssef Barhomi, MSc, MEng.
Research Software Engineer at the CLPS department
Brown University
T: +1 (617) 797 9929  | GMT -5:00

------------------------------------------------------------------------------
Try New Relic Now & We'll Send You this Cool Shirt
New Relic is the only SaaS-based application performance monitoring service 
that delivers powerful full stack analytics. Optimize and monitor your
browser, app, & servers with just a few lines of code. Try New Relic
and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr

_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Distributed RandomForests

Reply via email to