I've tried larger data sets. It wasn't pretty, much fewer features though
On Apr 25, 2013 4:03 AM, "Peter Prettenhofer" <peter.prettenho...@gmail.com>
wrote:
> Hi Youssef,
>
> please make sure that you use the latest version of sklearn (>= 0.13) - we
> did some enhancements to the sub-sampling procedure lately.
>
> Looking at the RandomForest code - it seems that the jobs=-1 should not be
> the issue for the parallel training of the trees since ``n_jobs =
> min(cpu_count(), self.n_estimators)`` which should be just 3 in your case,
> however, it will use cpu_count() processes to sort the feature values - so
> the bottleneck might be here. Please try to set the n_jobs parameter to a
> smaller constant (e.g. 4) and check if it works better.
>
> having said that: 1E8 samples is pretty large - the largest dataset that
> I've used so far was merely 1E6 but I've heard that people have used it for
> larger datasets too (probably not 1E8 though).
>
> Running the code on a cluster using IPython parallel should not be too
> hard - RF is a pretty simple algorithm - you could either patch the
> existing code to use IPython parallel instead of Joblib.Parallel (see
> forest.py) or simply write you own RF code which directly uses
> ``DecisionTreeClassifier``. Also, you likely can skip bootstrapping - it
> doesn't help much IMHO and can make the implementation a bit more
> "involved" - AFAIK the MSR guys didn't used boostrapping for their Kinect
> RF system...
>
> When it comes to other implementations you could look at rt-rank [1],
> which is a parallel implementation of both GBRT and RF; and WiseRF [2],
> which is compatible with sklearn but you have to obtain a license (free
> trial and academic version AFAIK).
>
> HTH,
>
> Peter
>
> [1] https://sites.google.com/site/rtranking/
>
> [2] http://about.wise.io/
>
>
> Am 25.04.2013 03:22 schrieb "Youssef Barhomi" <youssef.barh...@gmail.com>:
>
>> Hello,
>>
>> I am trying to reproduce the results of this paper:
>> http://research.microsoft.com/pubs/145347/BodyPartRecognition.pdf with
>> different kinds of data (monkey depth maps instead of humans). So I am
>> generating my depth features and training and classifying data with a
>> random forest with quite similar parameters of the paper.
>>
>> I would like to use sklearn.ensemble.RandomForestClassifier with 1E8
>> samples with 500 features. Since it seems to be a large dataset of feature
>> vectors, I did some trials with smaller subsets (1E4, 1E5, 1E6 samples) and
>> the last one seemed to be slower than a
>> O(n_samples*n_features*log(n_samples))
>> according to this:
>> http://scikit-learn.org/stable/modules/tree.html#complexity since 1E6
>> samples are taking a long time and I don't know when they will be done, I
>> would like better ways to estimate the ETA or find a way to speed up the
>> processing training. Also, I am watching my memory usage and I don't
>> seem to be swapping (29GB/48GB being used right now). The other thing is
>> that I requested n_jobs = -1 so it could use all cores of my machine (24
>> cores) but looking to my CPU usage, it doesn't seem to be using any of
>> them...
>>
>> So, do you guys have any ideas on:
>> - would a 1E8 samples be doable with your implementation of random
>> forests (3 trees , 20 levels deep)?
>> - running this code on a cluster using different iPython engines? or
>> would that require a lot of work?
>> - PCA for dimensionality reduction? (on the paper, they haven't used any
>> dim reduction, so I am trying to avoid that)
>> - other implementations that I could use for large datasets?
>>
>> PS: I am very new to this library but I am already impressed!! It's one
>> of the cleanest and probably most intuitive machine learning libraries out
>> there with a pretty impressive documentation and tutorials. Pretty amazing
>> work!!
>>
>> Thank you very much,
>> Youssef
>>
>>
>> ####################################
>> #######Here is a code snippet:
>> ####################################
>>
>> from sklearn.datasets import make_classification
>> from sklearn.ensemble import RandomForestClassifier
>> from sklearn.cross_validation import train_test_split
>> from sklearn.preprocessing import StandardScaler
>> import time
>> import numpy as np
>>
>> n_samples = 1000
>> n_features = 500
>> X, y = make_classification(n_samples, n_features, n_redundant=0,
>> n_informative=2,
>> random_state=1, n_clusters_per_class=1)
>> clf = RandomForestClassifier(max_depth=20, n_estimators=3, criterion =
>> 'entropy', n_jobs = -1, verbose = 10)
>>
>> rng = np.random.RandomState(2)
>> X += 2 * rng.uniform(size=X.shape)
>> linearly_separable = (X, y)
>> X = StandardScaler().fit_transform(X)
>> X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.4)
>> tic = time.time()
>> clf.fit(X_train, y_train)
>> score = clf.score(X_test, y_test)
>> print 'Time taken:', time.time() - tic, 'seconds'
>>
>>
>> --
>> Youssef Barhomi, MSc, MEng.
>> Research Software Engineer at the CLPS department
>> Brown University
>> T: +1 (617) 797 9929 | GMT -5:00
>>
>>
>> ------------------------------------------------------------------------------
>> Try New Relic Now & We'll Send You this Cool Shirt
>> New Relic is the only SaaS-based application performance monitoring
>> service
>> that delivers powerful full stack analytics. Optimize and monitor your
>> browser, app, & servers with just a few lines of code. Try New Relic
>> and get this awesome Nerd Life shirt!
>> http://p.sf.net/sfu/newrelic_d2d_apr
>> _______________________________________________
>> Scikit-learn-general mailing list
>> Scikit-learn-general@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>>
>
> ------------------------------------------------------------------------------
> Try New Relic Now & We'll Send You this Cool Shirt
> New Relic is the only SaaS-based application performance monitoring service
> that delivers powerful full stack analytics. Optimize and monitor your
> browser, app, & servers with just a few lines of code. Try New Relic
> and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
------------------------------------------------------------------------------
Try New Relic Now & We'll Send You this Cool Shirt
New Relic is the only SaaS-based application performance monitoring service
that delivers powerful full stack analytics. Optimize and monitor your
browser, app, & servers with just a few lines of code. Try New Relic
and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general