Re: [Scikit-learn-general] Distributed RandomForests

Peter Prettenhofer Thu, 25 Apr 2013 07:18:15 -0700

2013/4/25 Youssef Barhomi <youssef.barh...@gmail.com>

>
> thank you very much Peter,
>
> you are right about the n_jobs, something was going wrong with that. When
> n_jobs = -1, for larger dataset (1E6 for this case), no cpu was being used
> and the process was hanging for a while. getting n_jobs = 1 made everything
> work.
> yes, I will look into the iPython parallel and see if I can do that.
> I have just tried wiseRF and it worked like a charm with almost the same
> accuracy as the RF on sklearn with a 6x speedup so far. I was able to run a
> 1E6 x 500 dataset in 45 seconds with 14GB of RAM being used. I will try
> rtranking sometimes today. Now I am memory bound obviously, would you
> recommend an online RF library at this point?
>


1E6 in 45 seconds - that's really good

The memory consumption seems a little bit high though - for 1E6 x 500 I'd
expect roughly 4GB (assuming you use float64) - what's the memory
consumption right _before_ you call WiseRF.fit? Probably your memory
consumption peaks during the feature extraction. Make sure you free all
data structure except the data array - usually, the python interpreter
won't hand memory back to the operation system thus the memory consumption
reported by top will be higher than the actually allocated memory. To
further reduce memory consumption make sure that your array has dtype
np.float32; sklearn assumes float32 and will actually copy a float64 array
to float32; wiseRF does not do this AFAIK. Still, 1E8 won't fit into your
52GB box.

I don't have much experience with streaming / online RF - please drop me a
note about your progress here.



>
>
>  Am 25.04.2013 03:22 schrieb "Youssef Barhomi" <youssef.barh...@gmail.com
>> >:
>>
>>> Hello,
>>>
>>> I am trying to reproduce the results of this paper:
>>> http://research.microsoft.com/pubs/145347/BodyPartRecognition.pdf with
>>> different kinds of data (monkey depth maps instead of humans). So I am
>>> generating my depth features and training  and classifying data with a
>>> random forest with quite similar parameters of the paper.
>>>
>>> I would like to use sklearn.ensemble.RandomForestClassifier with 1E8
>>> samples with 500 features. Since it seems to be a large dataset of feature
>>> vectors, I did some trials with smaller subsets (1E4, 1E5, 1E6 samples) and
>>> the last one seemed to be slower than a 
>>> O(n_samples*n_features*log(n_samples))
>>> according to this:
>>> http://scikit-learn.org/stable/modules/tree.html#complexity since 1E6
>>> samples are taking a long time and I don't know when they will be done, I
>>> would like better ways to estimate the ETA or find a way to speed up the
>>> processing training. Also, I am watching my memory usage and I don't
>>> seem to be swapping (29GB/48GB being used right now). The other thing is
>>> that I requested n_jobs = -1 so it could use all cores of my machine (24
>>> cores) but looking to my CPU usage, it doesn't seem to be using any of
>>> them...
>>>
>>> So, do you guys have any ideas on:
>>> - would a 1E8 samples be doable with your implementation of random
>>> forests (3 trees , 20 levels deep)?
>>> - running this code on a cluster using different iPython engines? or
>>> would that require a lot of work?
>>> - PCA for dimensionality reduction? (on the paper, they haven't used any
>>> dim reduction, so I am trying to avoid that)
>>> - other implementations that I could use for large datasets?
>>>
>>> PS: I am very new to this library but I am already impressed!! It's one
>>> of the cleanest and probably most intuitive machine learning libraries out
>>> there with a pretty impressive documentation and tutorials. Pretty amazing
>>> work!!
>>>
>>> Thank you very much,
>>> Youssef
>>>
>>>
>>> ####################################
>>> #######Here is a code snippet:
>>> ####################################
>>>
>>> from sklearn.datasets import make_classification
>>> from sklearn.ensemble import RandomForestClassifier
>>> from sklearn.cross_validation import train_test_split
>>> from sklearn.preprocessing import StandardScaler
>>> import time
>>> import numpy as np
>>>
>>> n_samples = 1000
>>> n_features = 500
>>> X, y = make_classification(n_samples, n_features, n_redundant=0,
>>> n_informative=2,
>>>                                random_state=1, n_clusters_per_class=1)
>>> clf = RandomForestClassifier(max_depth=20, n_estimators=3, criterion =
>>> 'entropy', n_jobs = -1, verbose = 10)
>>>
>>> rng = np.random.RandomState(2)
>>> X += 2 * rng.uniform(size=X.shape)
>>> linearly_separable = (X, y)
>>> X = StandardScaler().fit_transform(X)
>>> X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.4)
>>> tic = time.time()
>>> clf.fit(X_train, y_train)
>>> score = clf.score(X_test, y_test)
>>> print 'Time taken:', time.time() - tic, 'seconds'
>>>
>>>
>>> --
>>> Youssef Barhomi, MSc, MEng.
>>> Research Software Engineer at the CLPS department
>>> Brown University
>>> T: +1 (617) 797 9929  | GMT -5:00
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> Try New Relic Now & We'll Send You this Cool Shirt
>>> New Relic is the only SaaS-based application performance monitoring
>>> service
>>> that delivers powerful full stack analytics. Optimize and monitor your
>>> browser, app, & servers with just a few lines of code. Try New Relic
>>> and get this awesome Nerd Life shirt!
>>> http://p.sf.net/sfu/newrelic_d2d_apr
>>> _______________________________________________
>>> Scikit-learn-general mailing list
>>> Scikit-learn-general@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>
>>>
>>
>> ------------------------------------------------------------------------------
>> Try New Relic Now & We'll Send You this Cool Shirt
>> New Relic is the only SaaS-based application performance monitoring
>> service
>> that delivers powerful full stack analytics. Optimize and monitor your
>> browser, app, & servers with just a few lines of code. Try New Relic
>> and get this awesome Nerd Life shirt!
>> http://p.sf.net/sfu/newrelic_d2d_apr
>> _______________________________________________
>> Scikit-learn-general mailing list
>> Scikit-learn-general@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>>
>
>
> --
> Youssef Barhomi, MSc, MEng.
> Research Software Engineer at the CLPS department
> Brown University
> T: +1 (617) 797 9929  | GMT -5:00
>
>
> ------------------------------------------------------------------------------
> Try New Relic Now & We'll Send You this Cool Shirt
> New Relic is the only SaaS-based application performance monitoring service
> that delivers powerful full stack analytics. Optimize and monitor your
> browser, app, & servers with just a few lines of code. Try New Relic
> and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>


-- 
Peter Prettenhofer

------------------------------------------------------------------------------
Try New Relic Now & We'll Send You this Cool Shirt
New Relic is the only SaaS-based application performance monitoring service 
that delivers powerful full stack analytics. Optimize and monitor your
browser, app, & servers with just a few lines of code. Try New Relic
and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr

_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Distributed RandomForests

Reply via email to