Re: [Scikit-learn-general] Distributed RandomForests

Youssef Barhomi Fri, 26 Apr 2013 06:41:49 -0700

Thank you Peter, I found that the feature extraction was taking a lot of
extra memory and that was not related to wiseRF, so you were right.
Actually, from "top" it seems the training part was taking only an extra
20% of memory than the size of the dataset itself, wich is pretty
impressive. So at this point I am pretty memory bound because of the
dataset size. The only other way to deal with this would be a PCA, or a
distributed random forests. WiseRF people are working on "sequoia" which is
a RF that should run on the cloud, so I will definitely use that when it's
ready.





On Thu, Apr 25, 2013 at 10:17 AM, Peter Prettenhofer <
peter.prettenho...@gmail.com> wrote:

>
>
>
> 2013/4/25 Youssef Barhomi <youssef.barh...@gmail.com>
>
>>
>>  thank you very much Peter,
>>
>> you are right about the n_jobs, something was going wrong with that. When
>> n_jobs = -1, for larger dataset (1E6 for this case), no cpu was being used
>> and the process was hanging for a while. getting n_jobs = 1 made everything
>> work.
>> yes, I will look into the iPython parallel and see if I can do that.
>> I have just tried wiseRF and it worked like a charm with almost the same
>> accuracy as the RF on sklearn with a 6x speedup so far. I was able to run a
>> 1E6 x 500 dataset in 45 seconds with 14GB of RAM being used. I will try
>> rtranking sometimes today. Now I am memory bound obviously, would you
>> recommend an online RF library at this point?
>>
>
> 1E6 in 45 seconds - that's really good
>
> The memory consumption seems a little bit high though - for 1E6 x 500 I'd
> expect roughly 4GB (assuming you use float64) - what's the memory
> consumption right _before_ you call WiseRF.fit? Probably your memory
> consumption peaks during the feature extraction. Make sure you free all
> data structure except the data array - usually, the python interpreter
> won't hand memory back to the operation system thus the memory consumption
> reported by top will be higher than the actually allocated memory. To
> further reduce memory consumption make sure that your array has dtype
> np.float32; sklearn assumes float32 and will actually copy a float64 array
> to float32; wiseRF does not do this AFAIK. Still, 1E8 won't fit into your
> 52GB box.
>
> I don't have much experience with streaming / online RF - please drop me a
> note about your progress here.
>
>
>
>>
>>
>>  Am 25.04.2013 03:22 schrieb "Youssef Barhomi" <youssef.barh...@gmail.com
>>> >:
>>>
>>>> Hello,
>>>>
>>>> I am trying to reproduce the results of this paper:
>>>> http://research.microsoft.com/pubs/145347/BodyPartRecognition.pdf with
>>>> different kinds of data (monkey depth maps instead of humans). So I am
>>>> generating my depth features and training  and classifying data with a
>>>> random forest with quite similar parameters of the paper.
>>>>
>>>> I would like to use sklearn.ensemble.RandomForestClassifier with 1E8
>>>> samples with 500 features. Since it seems to be a large dataset of feature
>>>> vectors, I did some trials with smaller subsets (1E4, 1E5, 1E6 samples) and
>>>> the last one seemed to be slower than a 
>>>> O(n_samples*n_features*log(n_samples))
>>>> according to this:
>>>> http://scikit-learn.org/stable/modules/tree.html#complexity since 1E6
>>>> samples are taking a long time and I don't know when they will be done, I
>>>> would like better ways to estimate the ETA or find a way to speed up the
>>>> processing training. Also, I am watching my memory usage and I don't
>>>> seem to be swapping (29GB/48GB being used right now). The other thing is
>>>> that I requested n_jobs = -1 so it could use all cores of my machine (24
>>>> cores) but looking to my CPU usage, it doesn't seem to be using any of
>>>> them...
>>>>
>>>> So, do you guys have any ideas on:
>>>> - would a 1E8 samples be doable with your implementation of random
>>>> forests (3 trees , 20 levels deep)?
>>>> - running this code on a cluster using different iPython engines? or
>>>> would that require a lot of work?
>>>> - PCA for dimensionality reduction? (on the paper, they haven't used
>>>> any dim reduction, so I am trying to avoid that)
>>>> - other implementations that I could use for large datasets?
>>>>
>>>> PS: I am very new to this library but I am already impressed!! It's one
>>>> of the cleanest and probably most intuitive machine learning libraries out
>>>> there with a pretty impressive documentation and tutorials. Pretty amazing
>>>> work!!
>>>>
>>>> Thank you very much,
>>>> Youssef
>>>>
>>>>
>>>> ####################################
>>>> #######Here is a code snippet:
>>>> ####################################
>>>>
>>>> from sklearn.datasets import make_classification
>>>> from sklearn.ensemble import RandomForestClassifier
>>>> from sklearn.cross_validation import train_test_split
>>>> from sklearn.preprocessing import StandardScaler
>>>> import time
>>>> import numpy as np
>>>>
>>>> n_samples = 1000
>>>> n_features = 500
>>>> X, y = make_classification(n_samples, n_features, n_redundant=0,
>>>> n_informative=2,
>>>>                                random_state=1, n_clusters_per_class=1)
>>>> clf = RandomForestClassifier(max_depth=20, n_estimators=3, criterion =
>>>> 'entropy', n_jobs = -1, verbose = 10)
>>>>
>>>> rng = np.random.RandomState(2)
>>>> X += 2 * rng.uniform(size=X.shape)
>>>> linearly_separable = (X, y)
>>>> X = StandardScaler().fit_transform(X)
>>>> X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.4)
>>>> tic = time.time()
>>>> clf.fit(X_train, y_train)
>>>> score = clf.score(X_test, y_test)
>>>> print 'Time taken:', time.time() - tic, 'seconds'
>>>>
>>>>
>>>> --
>>>> Youssef Barhomi, MSc, MEng.
>>>> Research Software Engineer at the CLPS department
>>>> Brown University
>>>> T: +1 (617) 797 9929  | GMT -5:00
>>>>
>>>>
>>>> ------------------------------------------------------------------------------
>>>> Try New Relic Now & We'll Send You this Cool Shirt
>>>> New Relic is the only SaaS-based application performance monitoring
>>>> service
>>>> that delivers powerful full stack analytics. Optimize and monitor your
>>>> browser, app, & servers with just a few lines of code. Try New Relic
>>>> and get this awesome Nerd Life shirt!
>>>> http://p.sf.net/sfu/newrelic_d2d_apr
>>>> _______________________________________________
>>>> Scikit-learn-general mailing list
>>>> Scikit-learn-general@lists.sourceforge.net
>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>
>>>>
>>>
>>> ------------------------------------------------------------------------------
>>> Try New Relic Now & We'll Send You this Cool Shirt
>>> New Relic is the only SaaS-based application performance monitoring
>>> service
>>> that delivers powerful full stack analytics. Optimize and monitor your
>>> browser, app, & servers with just a few lines of code. Try New Relic
>>> and get this awesome Nerd Life shirt!
>>> http://p.sf.net/sfu/newrelic_d2d_apr
>>> _______________________________________________
>>> Scikit-learn-general mailing list
>>> Scikit-learn-general@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>
>>>
>>
>>
>> --
>> Youssef Barhomi, MSc, MEng.
>> Research Software Engineer at the CLPS department
>> Brown University
>> T: +1 (617) 797 9929  | GMT -5:00
>>
>>
>> ------------------------------------------------------------------------------
>> Try New Relic Now & We'll Send You this Cool Shirt
>> New Relic is the only SaaS-based application performance monitoring
>> service
>> that delivers powerful full stack analytics. Optimize and monitor your
>> browser, app, & servers with just a few lines of code. Try New Relic
>> and get this awesome Nerd Life shirt!
>> http://p.sf.net/sfu/newrelic_d2d_apr
>> _______________________________________________
>> Scikit-learn-general mailing list
>> Scikit-learn-general@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>>
>
>
> --
> Peter Prettenhofer
>
>
> ------------------------------------------------------------------------------
> Try New Relic Now & We'll Send You this Cool Shirt
> New Relic is the only SaaS-based application performance monitoring service
> that delivers powerful full stack analytics. Optimize and monitor your
> browser, app, & servers with just a few lines of code. Try New Relic
> and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>


-- 
Youssef Barhomi, MSc, MEng.
Research Software Engineer at the CLPS department
Brown University
T: +1 (617) 797 9929  | GMT -5:00

------------------------------------------------------------------------------
Try New Relic Now & We'll Send You this Cool Shirt
New Relic is the only SaaS-based application performance monitoring service 
that delivers powerful full stack analytics. Optimize and monitor your
browser, app, & servers with just a few lines of code. Try New Relic
and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr

_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Distributed RandomForests

Reply via email to