Re: [Scikit-learn-general] How to extract the decision tree rule of each leaf node into Pandas Dataframe query?

Brian Scannell Thu, 27 Aug 2015 16:07:49 -0700

Rex,

For extracting decision rules as a Pandas query, here is some sample code
with a test case that should work. No promises though.


```
import pandas as pd
from sklearn.datasets import load_iris
from sklearn import tree
import sklearn

def get_queries(clf, feature_names):
    def recurse(node_id, rules):
        left = clf.tree_.children_left[node_id]
        right = clf.tree_.children_right[node_id]

        #Check if this is a decision node
        if left != tree._tree.TREE_LEAF:
            feature = feature_names[clf.tree_.feature[node_id]]
            new_rule = feature + " {0} " + "%.4f" %
clf.tree_.threshold[node_id]
            return (recurse(left, rules + [new_rule.format('<=')]) +
                    recurse(right, rules + [new_rule.format('>')]))
        else: # Leaf
            return [" and ".join(rules)]

    return recurse(node_id=0, rules=[])

iris = sklearn.datasets.load_iris()
clf = tree.DecisionTreeClassifier()
clf.fit(iris.data, iris.target)

# pandas queries must have valid python identifiers for column names
feature_names = [name.replace(' ','_').replace('(','').replace(')','') for
name in iris.feature_names]
leaf_queries = get_queries(clf, feature_names)

iris_df = pd.DataFrame(iris.data, columns=feature_names)

print("Total queried samples: {0}".format(sum([len(iris_df.query(query))
for query in leaf_queries])))
```

On Thu, Aug 27, 2015 at 6:15 PM Rex X <dnsr...@gmail.com> wrote:

> Hi Jocob,
>
> That is cool! Very helpful.
>
> In further, based on your idea, I can do a loop with random split and
> automatically find those leaf nodes satisfying the two fraud detect
> conditions.
>
> Here is one raised question. How to extract the associated decision rules
> to one selected leaf node?
>
> Usually the decision rules need to be merged from the split rules after
> the root node all the way down to this leaf node. It is hard work to select
> each of them manually.
>
>
> Best,
> Rex
>
>
>
>
> On Thu, Aug 27, 2015 at 12:32 PM, Jacob Schreiber <jmschreibe...@gmail.com
> > wrote:
>
>> Hi Rex
>>
>> I would set up the problem in the same way.
>>
>> Look at http://scikit-learn.org/stable/modules/tree.html. The
>> visualization should be of use to you, where you can manually inspect
>> good_usd_leaf and fraud_usd_leaf.
>>
>> If you want to do this automatically, you should look at
>> clf.tree_.value(), which is the vector of these values. You can reshape
>> this vector into 2xn, where each row is [good_usd_leaf, fraud_usd_leaf]. If
>> you use two copies, one for the raw values, and one for the percentages,
>> you can easily filter by your two rules.
>>
>> I'm not testing this code, so it probably has some minor mistakes, but it
>> should look ~something~ like the following;
>>
>> ```
>> clf = DecisionTreeClassifier()
>> clf.fit( datums )
>>
>> n = datums.shape[0]
>> raw = clf.tree_.value().reshape(2,n)
>> norm =clf.tree_.value().reshape(2,n)
>> norm /= norm.sum( axis=1 )
>>
>> print np.arange(n)[ (raw[:,1] >= 0.05) && (norm[:,1] >= 0.30) ]
>> ```
>>
>> I hope that makes sense!
>> Let me know if you have any other questions
>>
>> On Thu, Aug 27, 2015 at 11:12 AM, Rex X <dnsr...@gmail.com> wrote:
>>
>>> Hi Jacob,
>>>
>>> Let's consider one leaf node with three order transactions, one order is
>>> good ($30), and the other two are fraud ($35 + $35 = $70 fraud in total).
>>>
>>> The two class_weights are in equal weight, {'0':1, '1':1}, in which
>>> class '0' labels good, and the class '1' labels a fraud. The two classes
>>> are imbalanced, say, we have $100 fraud (from 10 orders), and $900 good
>>> transactions (from 1000 order).
>>>
>>> The sample_weight
>>> <http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html>
>>> is the normalized order amount. Assuming this superstore has a total sales
>>> of $1000. That means, this leaf node, has a value of [0.03, 0.07], and
>>> let's denote this value by [good_usd_leaf, *fraud_usd_leaf*].
>>>
>>> The fraud rate of this leaf node is
>>> fraud_rate_store=7%
>>> for the whole superstore, $70/$1000=7%, and
>>> *f**raud_rate_leaf*=70%
>>> in the context of this leaf node only, $70/($30+$70) = 70%.
>>>
>>> If we make a rule to decline orders based on the decision rules leading
>>> to this leaf node, we can save $70 bucks, or 7% for this superstore sales.
>>>
>>> With decision tree, it is very easy to find decision rules leading to
>>> good orders, since we have so many samples. However, we have no interest
>>> with those good transactions.
>>>
>>> We want to find out all leaf nodes with *fraud_usd_leaf *>= 0.05, and
>>> *fraud_rate_leaf* >= 30%.
>>>
>>> Any ideas what strategies can be applied for this fraud detection
>>> problem?
>>>
>>>
>>> Best,
>>> Rex
>>>
>>>
>>> On Thu, Aug 20, 2015 at 2:09 AM, Jacob Schreiber <
>>> jmschreibe...@gmail.com> wrote:
>>>
>>>> It sounds like you prefer false negatives over false positives (not
>>>> catching bad activity, but rarely misclassifying good activity as bad
>>>> activity). You can weight the different classes currently by setting the
>>>> sample weight on good activity points to be higher than those of bad
>>>> activity points. The classifier will automatically try to fit the good
>>>> activity better.
>>>>
>>>> If I misinterpreted what you asked, let me know. I wasn't exactly sure
>>>> what your first sentence meant.
>>>>
>>>> On Thu, Aug 20, 2015 at 10:53 AM, Rex X <dnsr...@gmail.com> wrote:
>>>>
>>>>> Very nice! Thanks to both of you, Jacob and Andreas!
>>>>>
>>>>> Andreas, yes, I'm interested in all leafs. The additional Pandas query
>>>>> done on each leaf node is a further check to inspect whether this leaf 
>>>>> node
>>>>> can be of interest or not.
>>>>>
>>>>> Binary classification for example, fraud detection to be more
>>>>> specific, we can set up sample_weight based on the order amount in the
>>>>> first run, to find those large-sale segments. But we don't want a rule 
>>>>> that
>>>>> decline too many good orders in the meantime. That means, we want to
>>>>> minimize both the number and amount of good samples in each leaf node. 
>>>>> This
>>>>> job seems cannot be done in the current decisionTree implementation. 
>>>>> That's
>>>>> why we need to fire a further Pandas query to do this job.
>>>>>
>>>>> Any comments?
>>>>>
>>>>> Best,
>>>>> Rex
>>>>>
>>>>>
>>>>> On Tue, Aug 18, 2015 at 9:23 AM, Jacob Schreiber <
>>>>> jmschreibe...@gmail.com> wrote:
>>>>>
>>>>>> There is no code to do it automatically, but you can use the
>>>>>> following to get array of thresholds:
>>>>>>
>>>>>> ```
>>>>>> clf = DecisionTreeClassifier()
>>>>>> clf.fit( datums, targets )
>>>>>> clf.tree_.thresholds
>>>>>> ```
>>>>>>
>>>>>> The full list of attributes you can call are (feature, threshold,
>>>>>> impurity, n_node_samples, weighted_n_node_samples, value (the 
>>>>>> prediction),
>>>>>> children_left, children_right).
>>>>>>
>>>>>> Does this help?
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Aug 18, 2015 at 7:53 AM, Andreas Mueller <t3k...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> I'm not aware of any ready-made code. But you can just get the
>>>>>>> boolean matrix by using ``apply`` and a one-hot encoder.
>>>>>>> Why are you interested in a single leave? the query seems to be able
>>>>>>> to return "only" a single boolean.
>>>>>>> It is probably more efficient to traverse the full tree for each
>>>>>>> data point if you are interested in all the leafs.
>>>>>>>
>>>>>>>
>>>>>>> On 08/18/2015 11:39 AM, Rex X wrote:
>>>>>>>
>>>>>>> Hi everyone,
>>>>>>>
>>>>>>> Is it possible to extract the decision tree rule associated with
>>>>>>> each leaf node into a Pandas Dataframe query
>>>>>>> <http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.query.html>?
>>>>>>> So that one can view the corresponding Dataframe content by feeding in 
>>>>>>> the
>>>>>>> decision rule.
>>>>>>>
>>>>>>> Best,
>>>>>>> Rex
>>>>>>>
>>>>>>>
>>>>>>> ------------------------------------------------------------------------------
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Scikit-learn-general mailing 
>>>>>>> listScikit-learn-general@lists.sourceforge.nethttps://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> ------------------------------------------------------------------------------
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Scikit-learn-general mailing list
>>>>>>> Scikit-learn-general@lists.sourceforge.net
>>>>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> ------------------------------------------------------------------------------
>>>>>>
>>>>>> _______________________________________________
>>>>>> Scikit-learn-general mailing list
>>>>>> Scikit-learn-general@lists.sourceforge.net
>>>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> ------------------------------------------------------------------------------
>>>>>
>>>>> _______________________________________________
>>>>> Scikit-learn-general mailing list
>>>>> Scikit-learn-general@lists.sourceforge.net
>>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>>
>>>>>
>>>>
>>>>
>>>> ------------------------------------------------------------------------------
>>>>
>>>> _______________________________________________
>>>> Scikit-learn-general mailing list
>>>> Scikit-learn-general@lists.sourceforge.net
>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>
>>>>
>>>
>>>
>>> ------------------------------------------------------------------------------
>>>
>>> _______________________________________________
>>> Scikit-learn-general mailing list
>>> Scikit-learn-general@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>
>>>
>>
>>
>> ------------------------------------------------------------------------------
>>
>> _______________________________________________
>> Scikit-learn-general mailing list
>> Scikit-learn-general@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>>
>
> ------------------------------------------------------------------------------
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>

------------------------------------------------------------------------------

_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] How to extract the decision tree rule of each leaf node into Pandas Dataframe query?

Reply via email to