Hi Youssef.
I would strongly advise you to use a image specific random forest
implementation.
There is a very good implementation by some other MSRC people:
http://research.microsoft.com/en-us/downloads/03e0ca05-8aa9-49f6-801f-bb23846dc147/
It implements a much more complicated model, decision tree fields, but
can also be used for plain random forests.
Cheers,
Andy
On 04/25/2013 03:19 AM, Youssef Barhomi wrote:
Hello,
I am trying to reproduce the results of this paper:
http://research.microsoft.com/pubs/145347/BodyPartRecognition.pdf with
different kinds of data (monkey depth maps instead of humans). So I am
generating my depth features and training and classifying data with a
random forest with quite similar parameters of the paper.
I would like to use sklearn.ensemble.RandomForestClassifier with 1E8
samples with 500 features. Since it seems to be a large dataset of
feature vectors, I did some trials with smaller subsets (1E4, 1E5, 1E6
samples) and the last one seemed to be slower than a
O(n_samples*n_features*log(n_samples)) according to
this:http://scikit-learn.org/stable/modules/tree.html#complexitysince
1E6 samples are taking a long time and I don't know when they will be
done, I would like better ways to estimate the ETA or find a way to
speed up the processing training. Also, I am watching my memory usage
and I don't seem to be swapping (29GB/48GB being used right now). The
other thing is that I requested n_jobs = -1 so it could use all cores
of my machine (24 cores) but looking to my CPU usage, it doesn't seem
to be using any of them...
So, do you guys have any ideas on:
- would a 1E8 samples be doable with your implementation of random
forests (3 trees , 20 levels deep)?
- running this code on a cluster using different iPython engines? or
would that require a lot of work?
- PCA for dimensionality reduction? (on the paper, they haven't used
any dim reduction, so I am trying to avoid that)
- other implementations that I could use for large datasets?
PS: I am very new to this library but I am already impressed!! It's
one of the cleanest and probably most intuitive machine learning
libraries out there with a pretty impressive documentation and
tutorials. Pretty amazing work!!
Thank you very much,
Youssef
####################################
#######Here is a code snippet:
####################################
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import train_test_split
from sklearn.preprocessing import StandardScaler
import time
import numpy as np
n_samples = 1000
n_features = 500
X, y = make_classification(n_samples, n_features, n_redundant=0,
n_informative=2,
random_state=1, n_clusters_per_class=1)
clf = RandomForestClassifier(max_depth=20, n_estimators=3, criterion =
'entropy', n_jobs = -1, verbose = 10)
rng = np.random.RandomState(2)
X += 2 * rng.uniform(size=X.shape)
linearly_separable = (X, y)
X = StandardScaler().fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.4)
tic = time.time()
clf.fit(X_train, y_train)
score = clf.score(X_test, y_test)
print 'Time taken:', time.time() - tic, 'seconds'
--
Youssef Barhomi, MSc, MEng.
Research Software Engineer at the CLPS department
Brown University
T: +1 (617) 797 9929 | GMT -5:00
------------------------------------------------------------------------------
Try New Relic Now & We'll Send You this Cool Shirt
New Relic is the only SaaS-based application performance monitoring service
that delivers powerful full stack analytics. Optimize and monitor your
browser, app, & servers with just a few lines of code. Try New Relic
and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Try New Relic Now & We'll Send You this Cool Shirt
New Relic is the only SaaS-based application performance monitoring service
that delivers powerful full stack analytics. Optimize and monitor your
browser, app, & servers with just a few lines of code. Try New Relic
and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general