Re: [Scikit-learn-general] Access data arriving at leaf nodes

Joel Nothman Tue, 14 Oct 2014 17:26:35 -0700

If what you need is "the samples that ended up at each leaf node during
training", is this not something like:


from collections import defaultdict
samples_by_node = defaultdict(list)
for est_ind, est_data in enumerate(indices.T):
    for sample_ind, leaf in enumerate(est_data):
        samples_by_node[est_ind, leaf].append(sample_ind)

?

On 15 October 2014 09:59, M Asad <[email protected]> wrote:

> I am not sure if there is already a method to get this but I have read
> docs and there doesnt seem to be any. Please correct me if I am wrong.
>
> Actually I am trying to get probability distribution at each leaf node, as
> done in the book "Decision Forests for Computer Vision and Medical Image
> Analysis", for which I need the samples that ended up at each leaf node
> during training. Then I will use kernel density estimation to get
> continuous probability distribution at each leaf node. I have done this in
> my own implementation in C++/OpenCV, however when using scikit all I need
> are those particular samples at the leaf node.
>
> For prediction, I have used apply() to get index of the predicted leaf.
> forestReg.estimators_[i].tree_.value[j] returns only one prediction value,
> however if I call: forestReg.estimator_[i].tree_.n_node_samples[j] I get
> number of samples to be more than min_samples_leaf ( which I have provided
> to be 5 at the moment )
> Here j is the index of a leaf node within the tree with index i
>
> If it helps here is the code I am using:
>
> # read the training data
> trainingLabels = readMatFromFile('dataSet//trainingLabelsSim.dat').T
> trainingData = readMatFromFile('dataSet//trainingDataSim.dat').T
>
> # read the testing data
> testingLabels = readMatFromFile('dataSet//testingLabelsSim.dat').T
> testingData = readMatFromFile('dataSet//testingDataSim.dat').T
>
> forestClf = RandomForestRegressor(n_estimators = 100, min_samples_leaf =
> 5, random_state = 0, max_depth =20, max_features = 10, verbose = 1)
>
> forestClf.fit(trainingData, trainingLabels)
>
> index = forestClf.apply(testingData)
> leafVals = np.zeros(index.shape)
> for j in range(0, index.shape[0]):
>     for i in range(0, index.shape[1]):
>         leafVals[j,i] = forestClf.estimators_[i].tree_.value[index[j,i]
>
>
>
> Many thanks in advance
> Muhammad
>
> Date: Wed, 15 Oct 2014 07:59:09 +1100
>> From: Joel Nothman <[email protected]>
>> Subject: Re: [Scikit-learn-general] Access data arriving at leaf nodes
>> To: scikit-learn-general <[email protected]>
>> Message-ID:
>>         <CAAkaFLUB_ApLWGosUovxfEoEi34bcw-ePke0TBCKF3NrQpF=
>> [email protected]>
>> Content-Type: text/plain; charset="utf-8"
>>
>> What do you mean by all the values that make up a leaf node? If you mean
>> all the samples, isn't apply sufficient?
>>
>> On 15 October 2014 06:20, M Asad <[email protected]> wrote:
>>
>> > Hi,
>> >
>> > I am kind of new to scikit, however I have learned a alot of things now.
>> >
>> > I am using scikit.ensemble.RandomForestRegressor to train on a data and
>> > predict using some input samples later.
>> > What I am trying to do now is to access the actual values that make up
>> > each leaf node.
>> >
>> > I have managed to get the index of each leaf node used for prediction by
>> > using apply() function
>> > And I can also access the prediction value by calling
>> > forestReg.estimators_[i].tree_.value[j] where i is the tree index and j
>> is
>> > the index of the leaf node.
>> >
>> > Does anyone have any idea how I can get all the values that make up a
>> leaf
>> > node? I have set min_samples_leaf = 5 so each leaf node comprises of at
>> > least 5 samples.
>> >
>> > Many thanks!
>> >
>> > Best regards,
>> > Muhammad Asad
>> >
>> >
>> >
>> ------------------------------------------------------------------------------
>> > Comprehensive Server Monitoring with Site24x7.
>> > Monitor 10 servers for $9/Month.
>> > Get alerted through email, SMS, voice calls or mobile push
>> notifications.
>> > Take corrective actions from your mobile device.
>> > http://p.sf.net/sfu/Zoho
>> > _______________________________________________
>> > Scikit-learn-general mailing list
>> > [email protected]
>> > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>> >
>> >
>> -------------- next part --------------
>> An HTML attachment was scrubbed...
>>
>> ------------------------------
>>
>> Message: 4
>> Date: Wed, 15 Oct 2014 08:01:27 +1100
>> From: Joel Nothman <[email protected]>
>> Subject: Re: [Scikit-learn-general] Suggestion: break up the metrics
>>         module
>> To: scikit-learn-general <[email protected]>
>> Message-ID:
>>         <
>> caakaflu0fyhnfagmu9dkhr8oppd_kerirux+ckbfm7vunrn...@mail.gmail.com>
>> Content-Type: text/plain; charset="utf-8"
>>
>> We had a plan to move out the model selection stuff. Presently that talked
>> about moving scorers, but not necessarily the metrics underlying them....
>>
>> On 15 October 2014 07:16, Lars Buitinck <[email protected]> wrote:
>>
>> > 2014-10-14 21:53 GMT+02:00 Robert Layton <[email protected]>:
>> > > Currently the word "metrics" is overloaded with at least two type of
>> > > algorithms in that module. The first is evaluation metrics and the
>> > second is
>> > > functions dealing with distance metrics.
>> > >
>> > > My suggestion is to:
>> > > 1) Move the evaluation metrics to a new top level folder called
>> > "evaluation"
>> > > 2) Move the distance metrics to a new top level folder called
>> "distance"
>> > > 3) Create pointers with deprecation warnings from the metrics folder
>> to
>> > the
>> > > above two folders.
>> > >
>> > > This would be a big job -- lots of documentation to fix etc. So I
>> wanted
>> > to
>> > > get suggestions before I start.
>> > >
>> > > Thoughts?
>> >
>> > Didn't we already have a plan to move out the evaluation stuff?
>> >
>> > Btw., there are also similarity functions in the module. Putting those
>> > in a "distance" module seems a bit strange, so I suggest we just keep
>> > the name for at least the distance stuff. (I know "metric" is the
>> > mathematician's term for distance, but "similarity metric" is common
>> > enough, I think.)
>> >
>> >
>> >
>> ------------------------------------------------------------------------------
>> > Comprehensive Server Monitoring with Site24x7.
>> > Monitor 10 servers for $9/Month.
>> > Get alerted through email, SMS, voice calls or mobile push
>> notifications.
>> > Take corrective actions from your mobile device.
>> > http://p.sf.net/sfu/Zoho
>> > _______________________________________________
>> > Scikit-learn-general mailing list
>> > [email protected]
>> > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>> >
>> -------------- next part --------------
>> An HTML attachment was scrubbed...
>>
>> ------------------------------
>>
>> Message: 5
>> Date: Tue, 14 Oct 2014 23:08:02 +0200
>> From: Gael Varoquaux <[email protected]>
>> Subject: Re: [Scikit-learn-general] Suggestion: break up the metrics
>>         module
>> To: [email protected]
>> Message-ID: <[email protected]>
>> Content-Type: text/plain; charset=iso-8859-1
>>
>> On Wed, Oct 15, 2014 at 06:53:35AM +1100, Robert Layton wrote:
>> > Currently the word "metrics" is overloaded with at least two type of
>> > algorithms in that module. The first is evaluation metrics and the
>> > second is functions dealing with distance metrics.
>>
>> Please, let's just try as much as possible to avoid such changes.
>>
>> The goal of such a change is to make things prettier, or more logical,
>> according to a certain logic. The benefit is that, to certain, it will
>> make more sens. What's important to keep in mind, is that most users
>> don't understand the fine details of the acceptance of the names, and
>> that none of the module names make a huge amount of sens. Documentation
>> and Google searchs is what really sorts users out.
>>
>> By changing module names, or any kind of API, we are making these Google
>> searchs unreliable, so we are actually making it harder for the users.
>>
>> In addition, we are breaking people's code. Yes we have a deprecation
>> cycle, but it's costly for everybody to follow our changes.
>>
>> Thus, for an API change (and that's an API change), there needs to be
>> clear benefits, IMHO.
>>
>> Ga?l
>>
>>
>>
>>
>>
>> ------------------------------
>>
>> Message: 6
>> Date: Tue, 14 Oct 2014 17:22:03 -0400
>> From: Olivier Grisel <[email protected]>
>> Subject: Re: [Scikit-learn-general] Access data arriving at leaf nodes
>> To: scikit-learn-general <[email protected]>
>> Message-ID:
>>         <CAFvE7K6A3UpC=nuMQiKKCmFZntp+pe6+4xqpnUq=_
>> [email protected]>
>> Content-Type: text/plain; charset=UTF-8
>>
>>
>> 2014-10-14 15:20 GMT-04:00 M Asad <[email protected]>:
>> > Hi,
>> >
>> > I am kind of new to scikit, however I have learned a alot of things now.
>> >
>> > I am using scikit.ensemble.RandomForestRegressor to train on a data and
>> > predict using some input samples later.
>> > What I am trying to do now is to access the actual values that make up
>> each
>> > leaf node.
>> >
>> > I have managed to get the index of each leaf node used for prediction by
>> > using apply() function
>> > And I can also access the prediction value by calling
>> > forestReg.estimators_[i].tree_.value[j] where i is the tree index and j
>> is
>> > the index of the leaf node.
>> >
>> > Does anyone have any idea how I can get all the values that make up a
>> leaf
>> > node? I have set min_samples_leaf = 5 so each leaf node comprises of at
>> > least 5 samples.
>>
>> I am not exactly sure about what you are trying to do but maybe having
>> a look at the source code of the `predict` method of the trees will
>> help:
>>
>>
>> https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/_tree.pyx#L2417
>>
>> --
>> Olivier
>>
>>
>>
>> ------------------------------
>>
>>
>> ------------------------------------------------------------------------------
>> Comprehensive Server Monitoring with Site24x7.
>> Monitor 10 servers for $9/Month.
>> Get alerted through email, SMS, voice calls or mobile push notifications.
>> Take corrective actions from your mobile device.
>> http://p.sf.net/sfu/Zoho
>>
>> ------------------------------
>>
>> _______________________________________________
>> Scikit-learn-general mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>>
>> End of Scikit-learn-general Digest, Vol 57, Issue 18
>> ****************************************************
>>
>
>
>
> ------------------------------------------------------------------------------
> Comprehensive Server Monitoring with Site24x7.
> Monitor 10 servers for $9/Month.
> Get alerted through email, SMS, voice calls or mobile push notifications.
> Take corrective actions from your mobile device.
> http://p.sf.net/sfu/Zoho
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>

------------------------------------------------------------------------------
Comprehensive Server Monitoring with Site24x7.
Monitor 10 servers for $9/Month.
Get alerted through email, SMS, voice calls or mobile push notifications.
Take corrective actions from your mobile device.
http://p.sf.net/sfu/Zoho

_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Access data arriving at leaf nodes

Reply via email to