Hi, I confirm what has been said before. Samples are not stored anywhere in the leafs -- only the final prediction along with some statistics. To do what you want, you have to recompute the distribution yourself, eg using apply and then grouping by leaf ids.
Gilles On 15 October 2014 02:25, Joel Nothman <joel.noth...@gmail.com> wrote: > If what you need is "the samples that ended up at each leaf node during > training", is this not something like: > > from collections import defaultdict > samples_by_node = defaultdict(list) > for est_ind, est_data in enumerate(indices.T): > for sample_ind, leaf in enumerate(est_data): > samples_by_node[est_ind, leaf].append(sample_ind) > > ? > > On 15 October 2014 09:59, M Asad <masad....@gmail.com> wrote: >> >> I am not sure if there is already a method to get this but I have read >> docs and there doesnt seem to be any. Please correct me if I am wrong. >> >> Actually I am trying to get probability distribution at each leaf node, as >> done in the book "Decision Forests for Computer Vision and Medical Image >> Analysis", for which I need the samples that ended up at each leaf node >> during training. Then I will use kernel density estimation to get continuous >> probability distribution at each leaf node. I have done this in my own >> implementation in C++/OpenCV, however when using scikit all I need are those >> particular samples at the leaf node. >> >> For prediction, I have used apply() to get index of the predicted leaf. >> forestReg.estimators_[i].tree_.value[j] returns only one prediction value, >> however if I call: forestReg.estimator_[i].tree_.n_node_samples[j] I get >> number of samples to be more than min_samples_leaf ( which I have provided >> to be 5 at the moment ) >> Here j is the index of a leaf node within the tree with index i >> >> If it helps here is the code I am using: >> >> # read the training data >> trainingLabels = readMatFromFile('dataSet//trainingLabelsSim.dat').T >> trainingData = readMatFromFile('dataSet//trainingDataSim.dat').T >> >> # read the testing data >> testingLabels = readMatFromFile('dataSet//testingLabelsSim.dat').T >> testingData = readMatFromFile('dataSet//testingDataSim.dat').T >> >> forestClf = RandomForestRegressor(n_estimators = 100, min_samples_leaf = >> 5, random_state = 0, max_depth =20, max_features = 10, verbose = 1) >> >> forestClf.fit(trainingData, trainingLabels) >> >> index = forestClf.apply(testingData) >> leafVals = np.zeros(index.shape) >> for j in range(0, index.shape[0]): >> for i in range(0, index.shape[1]): >> leafVals[j,i] = forestClf.estimators_[i].tree_.value[index[j,i] >> >> >> >> Many thanks in advance >> Muhammad >> >>> Date: Wed, 15 Oct 2014 07:59:09 +1100 >>> From: Joel Nothman <joel.noth...@gmail.com> >>> Subject: Re: [Scikit-learn-general] Access data arriving at leaf nodes >>> To: scikit-learn-general <scikit-learn-general@lists.sourceforge.net> >>> Message-ID: >>> >>> <CAAkaFLUB_ApLWGosUovxfEoEi34bcw-ePke0TBCKF3NrQpF=u...@mail.gmail.com> >>> Content-Type: text/plain; charset="utf-8" >>> >>> What do you mean by all the values that make up a leaf node? If you mean >>> all the samples, isn't apply sufficient? >>> >>> On 15 October 2014 06:20, M Asad <masad....@gmail.com> wrote: >>> >>> > Hi, >>> > >>> > I am kind of new to scikit, however I have learned a alot of things >>> > now. >>> > >>> > I am using scikit.ensemble.RandomForestRegressor to train on a data and >>> > predict using some input samples later. >>> > What I am trying to do now is to access the actual values that make up >>> > each leaf node. >>> > >>> > I have managed to get the index of each leaf node used for prediction >>> > by >>> > using apply() function >>> > And I can also access the prediction value by calling >>> > forestReg.estimators_[i].tree_.value[j] where i is the tree index and j >>> > is >>> > the index of the leaf node. >>> > >>> > Does anyone have any idea how I can get all the values that make up a >>> > leaf >>> > node? I have set min_samples_leaf = 5 so each leaf node comprises of at >>> > least 5 samples. >>> > >>> > Many thanks! >>> > >>> > Best regards, >>> > Muhammad Asad >>> > >>> > >>> > >>> > ------------------------------------------------------------------------------ >>> > Comprehensive Server Monitoring with Site24x7. >>> > Monitor 10 servers for $9/Month. >>> > Get alerted through email, SMS, voice calls or mobile push >>> > notifications. >>> > Take corrective actions from your mobile device. >>> > http://p.sf.net/sfu/Zoho >>> > _______________________________________________ >>> > Scikit-learn-general mailing list >>> > Scikit-learn-general@lists.sourceforge.net >>> > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general >>> > >>> > >>> -------------- next part -------------- >>> An HTML attachment was scrubbed... >>> >>> ------------------------------ >>> >>> Message: 4 >>> Date: Wed, 15 Oct 2014 08:01:27 +1100 >>> From: Joel Nothman <joel.noth...@gmail.com> >>> Subject: Re: [Scikit-learn-general] Suggestion: break up the metrics >>> module >>> To: scikit-learn-general <scikit-learn-general@lists.sourceforge.net> >>> Message-ID: >>> >>> <caakaflu0fyhnfagmu9dkhr8oppd_kerirux+ckbfm7vunrn...@mail.gmail.com> >>> Content-Type: text/plain; charset="utf-8" >>> >>> We had a plan to move out the model selection stuff. Presently that >>> talked >>> about moving scorers, but not necessarily the metrics underlying them.... >>> >>> On 15 October 2014 07:16, Lars Buitinck <larsm...@gmail.com> wrote: >>> >>> > 2014-10-14 21:53 GMT+02:00 Robert Layton <robertlay...@gmail.com>: >>> > > Currently the word "metrics" is overloaded with at least two type of >>> > > algorithms in that module. The first is evaluation metrics and the >>> > second is >>> > > functions dealing with distance metrics. >>> > > >>> > > My suggestion is to: >>> > > 1) Move the evaluation metrics to a new top level folder called >>> > "evaluation" >>> > > 2) Move the distance metrics to a new top level folder called >>> > > "distance" >>> > > 3) Create pointers with deprecation warnings from the metrics folder >>> > > to >>> > the >>> > > above two folders. >>> > > >>> > > This would be a big job -- lots of documentation to fix etc. So I >>> > > wanted >>> > to >>> > > get suggestions before I start. >>> > > >>> > > Thoughts? >>> > >>> > Didn't we already have a plan to move out the evaluation stuff? >>> > >>> > Btw., there are also similarity functions in the module. Putting those >>> > in a "distance" module seems a bit strange, so I suggest we just keep >>> > the name for at least the distance stuff. (I know "metric" is the >>> > mathematician's term for distance, but "similarity metric" is common >>> > enough, I think.) >>> > >>> > >>> > >>> > ------------------------------------------------------------------------------ >>> > Comprehensive Server Monitoring with Site24x7. >>> > Monitor 10 servers for $9/Month. >>> > Get alerted through email, SMS, voice calls or mobile push >>> > notifications. >>> > Take corrective actions from your mobile device. >>> > http://p.sf.net/sfu/Zoho >>> > _______________________________________________ >>> > Scikit-learn-general mailing list >>> > Scikit-learn-general@lists.sourceforge.net >>> > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general >>> > >>> -------------- next part -------------- >>> An HTML attachment was scrubbed... >>> >>> ------------------------------ >>> >>> Message: 5 >>> Date: Tue, 14 Oct 2014 23:08:02 +0200 >>> From: Gael Varoquaux <gael.varoqu...@normalesup.org> >>> Subject: Re: [Scikit-learn-general] Suggestion: break up the metrics >>> module >>> To: scikit-learn-general@lists.sourceforge.net >>> Message-ID: <20141014210802.gc26...@phare.normalesup.org> >>> Content-Type: text/plain; charset=iso-8859-1 >>> >>> On Wed, Oct 15, 2014 at 06:53:35AM +1100, Robert Layton wrote: >>> > Currently the word "metrics" is overloaded with at least two type of >>> > algorithms in that module. The first is evaluation metrics and the >>> > second is functions dealing with distance metrics. >>> >>> Please, let's just try as much as possible to avoid such changes. >>> >>> The goal of such a change is to make things prettier, or more logical, >>> according to a certain logic. The benefit is that, to certain, it will >>> make more sens. What's important to keep in mind, is that most users >>> don't understand the fine details of the acceptance of the names, and >>> that none of the module names make a huge amount of sens. Documentation >>> and Google searchs is what really sorts users out. >>> >>> By changing module names, or any kind of API, we are making these Google >>> searchs unreliable, so we are actually making it harder for the users. >>> >>> In addition, we are breaking people's code. Yes we have a deprecation >>> cycle, but it's costly for everybody to follow our changes. >>> >>> Thus, for an API change (and that's an API change), there needs to be >>> clear benefits, IMHO. >>> >>> Ga?l >>> >>> >>> >>> >>> >>> ------------------------------ >>> >>> Message: 6 >>> Date: Tue, 14 Oct 2014 17:22:03 -0400 >>> From: Olivier Grisel <olivier.gri...@ensta.org> >>> Subject: Re: [Scikit-learn-general] Access data arriving at leaf nodes >>> To: scikit-learn-general <scikit-learn-general@lists.sourceforge.net> >>> Message-ID: >>> >>> <CAFvE7K6A3UpC=nuMQiKKCmFZntp+pe6+4xqpnUq=_a15buk...@mail.gmail.com> >>> Content-Type: text/plain; charset=UTF-8 >>> >>> >>> 2014-10-14 15:20 GMT-04:00 M Asad <masad....@gmail.com>: >>> > Hi, >>> > >>> > I am kind of new to scikit, however I have learned a alot of things >>> > now. >>> > >>> > I am using scikit.ensemble.RandomForestRegressor to train on a data and >>> > predict using some input samples later. >>> > What I am trying to do now is to access the actual values that make up >>> > each >>> > leaf node. >>> > >>> > I have managed to get the index of each leaf node used for prediction >>> > by >>> > using apply() function >>> > And I can also access the prediction value by calling >>> > forestReg.estimators_[i].tree_.value[j] where i is the tree index and j >>> > is >>> > the index of the leaf node. >>> > >>> > Does anyone have any idea how I can get all the values that make up a >>> > leaf >>> > node? I have set min_samples_leaf = 5 so each leaf node comprises of at >>> > least 5 samples. >>> >>> I am not exactly sure about what you are trying to do but maybe having >>> a look at the source code of the `predict` method of the trees will >>> help: >>> >>> >>> https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/_tree.pyx#L2417 >>> >>> -- >>> Olivier >>> >>> >>> >>> ------------------------------ >>> >>> >>> ------------------------------------------------------------------------------ >>> Comprehensive Server Monitoring with Site24x7. >>> Monitor 10 servers for $9/Month. >>> Get alerted through email, SMS, voice calls or mobile push notifications. >>> Take corrective actions from your mobile device. >>> http://p.sf.net/sfu/Zoho >>> >>> ------------------------------ >>> >>> _______________________________________________ >>> Scikit-learn-general mailing list >>> Scikit-learn-general@lists.sourceforge.net >>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general >>> >>> >>> End of Scikit-learn-general Digest, Vol 57, Issue 18 >>> **************************************************** >> >> >> >> >> ------------------------------------------------------------------------------ >> Comprehensive Server Monitoring with Site24x7. >> Monitor 10 servers for $9/Month. >> Get alerted through email, SMS, voice calls or mobile push notifications. >> Take corrective actions from your mobile device. >> http://p.sf.net/sfu/Zoho >> _______________________________________________ >> Scikit-learn-general mailing list >> Scikit-learn-general@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general >> > > > ------------------------------------------------------------------------------ > Comprehensive Server Monitoring with Site24x7. > Monitor 10 servers for $9/Month. > Get alerted through email, SMS, voice calls or mobile push notifications. > Take corrective actions from your mobile device. > http://p.sf.net/sfu/Zoho > _______________________________________________ > Scikit-learn-general mailing list > Scikit-learn-general@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general > ------------------------------------------------------------------------------ Comprehensive Server Monitoring with Site24x7. Monitor 10 servers for $9/Month. Get alerted through email, SMS, voice calls or mobile push notifications. Take corrective actions from your mobile device. http://p.sf.net/sfu/Zoho _______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general