Re: [Scikit-learn-general] How to extract the decision tree rule of each leaf node into Pandas Dataframe query?

Rex X Thu, 27 Aug 2015 17:55:39 -0700

Brian,

That is great to have query rules from a decision tree.



Back to the original question, is there any native way to make the decision
tree split following "gini or entropy" criterion, and satisfying the two
fraud detection conditions, with leaf nodes with *fraud_usd_leaf *>= 0.05,
and *fraud_rate_leaf* >= 30%, in the beginning of the split?

The strategy discussed by Jacob and me can work for small data set. But it
is hard work when the data set and number of categorical features grow.

Any comments will be greatly welcome!


Best,
Rex


On Thu, Aug 27, 2015 at 4:06 PM, Brian Scannell <brianjscann...@gmail.com>
wrote:

> Rex,
>
> For extracting decision rules as a Pandas query, here is some sample code
> with a test case that should work. No promises though.
>
> ```
> import pandas as pd
> from sklearn.datasets import load_iris
> from sklearn import tree
> import sklearn
>
> def get_queries(clf, feature_names):
>     def recurse(node_id, rules):
>         left = clf.tree_.children_left[node_id]
>         right = clf.tree_.children_right[node_id]
>
>         #Check if this is a decision node
>         if left != tree._tree.TREE_LEAF:
>             feature = feature_names[clf.tree_.feature[node_id]]
>             new_rule = feature + " {0} " + "%.4f" %
> clf.tree_.threshold[node_id]
>             return (recurse(left, rules + [new_rule.format('<=')]) +
>                     recurse(right, rules + [new_rule.format('>')]))
>         else: # Leaf
>             return [" and ".join(rules)]
>
>     return recurse(node_id=0, rules=[])
>
> iris = sklearn.datasets.load_iris()
> clf = tree.DecisionTreeClassifier()
> clf.fit(iris.data, iris.target)
>
> # pandas queries must have valid python identifiers for column names
> feature_names = [name.replace(' ','_').replace('(','').replace(')','') for
> name in iris.feature_names]
> leaf_queries = get_queries(clf, feature_names)
>
> iris_df = pd.DataFrame(iris.data, columns=feature_names)
>
> print("Total queried samples: {0}".format(sum([len(iris_df.query(query))
> for query in leaf_queries])))
> ```
>
> On Thu, Aug 27, 2015 at 6:15 PM Rex X <dnsr...@gmail.com> wrote:
>
>> Hi Jocob,
>>
>> That is cool! Very helpful.
>>
>> In further, based on your idea, I can do a loop with random split and
>> automatically find those leaf nodes satisfying the two fraud detect
>> conditions.
>>
>> Here is one raised question. How to extract the associated decision rules
>> to one selected leaf node?
>>
>> Usually the decision rules need to be merged from the split rules after
>> the root node all the way down to this leaf node. It is hard work to select
>> each of them manually.
>>
>>
>> Best,
>> Rex
>>
>>
>>
>>
>> On Thu, Aug 27, 2015 at 12:32 PM, Jacob Schreiber <
>> jmschreibe...@gmail.com> wrote:
>>
>>> Hi Rex
>>>
>>> I would set up the problem in the same way.
>>>
>>> Look at http://scikit-learn.org/stable/modules/tree.html. The
>>> visualization should be of use to you, where you can manually inspect
>>> good_usd_leaf and fraud_usd_leaf.
>>>
>>> If you want to do this automatically, you should look at
>>> clf.tree_.value(), which is the vector of these values. You can reshape
>>> this vector into 2xn, where each row is [good_usd_leaf, fraud_usd_leaf]. If
>>> you use two copies, one for the raw values, and one for the percentages,
>>> you can easily filter by your two rules.
>>>
>>> I'm not testing this code, so it probably has some minor mistakes, but
>>> it should look ~something~ like the following;
>>>
>>> ```
>>> clf = DecisionTreeClassifier()
>>> clf.fit( datums )
>>>
>>> n = datums.shape[0]
>>> raw = clf.tree_.value().reshape(2,n)
>>> norm =clf.tree_.value().reshape(2,n)
>>> norm /= norm.sum( axis=1 )
>>>
>>> print np.arange(n)[ (raw[:,1] >= 0.05) && (norm[:,1] >= 0.30) ]
>>> ```
>>>
>>> I hope that makes sense!
>>> Let me know if you have any other questions
>>>
>>> On Thu, Aug 27, 2015 at 11:12 AM, Rex X <dnsr...@gmail.com> wrote:
>>>
>>>> Hi Jacob,
>>>>
>>>> Let's consider one leaf node with three order transactions, one order
>>>> is good ($30), and the other two are fraud ($35 + $35 = $70 fraud in
>>>> total).
>>>>
>>>> The two class_weights are in equal weight, {'0':1, '1':1}, in which
>>>> class '0' labels good, and the class '1' labels a fraud. The two classes
>>>> are imbalanced, say, we have $100 fraud (from 10 orders), and $900 good
>>>> transactions (from 1000 order).
>>>>
>>>> The sample_weight
>>>> <http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html>
>>>> is the normalized order amount. Assuming this superstore has a total sales
>>>> of $1000. That means, this leaf node, has a value of [0.03, 0.07], and
>>>> let's denote this value by [good_usd_leaf, *fraud_usd_leaf*].
>>>>
>>>> The fraud rate of this leaf node is
>>>> fraud_rate_store=7%
>>>> for the whole superstore, $70/$1000=7%, and
>>>> *f**raud_rate_leaf*=70%
>>>> in the context of this leaf node only, $70/($30+$70) = 70%.
>>>>
>>>> If we make a rule to decline orders based on the decision rules leading
>>>> to this leaf node, we can save $70 bucks, or 7% for this superstore sales.
>>>>
>>>> With decision tree, it is very easy to find decision rules leading to
>>>> good orders, since we have so many samples. However, we have no interest
>>>> with those good transactions.
>>>>
>>>> We want to find out all leaf nodes with *fraud_usd_leaf *>= 0.05, and
>>>> *fraud_rate_leaf* >= 30%.
>>>>
>>>> Any ideas what strategies can be applied for this fraud detection
>>>> problem?
>>>>
>>>>
>>>> Best,
>>>> Rex
>>>>
>>>>
>>>> On Thu, Aug 20, 2015 at 2:09 AM, Jacob Schreiber <
>>>> jmschreibe...@gmail.com> wrote:
>>>>
>>>>> It sounds like you prefer false negatives over false positives (not
>>>>> catching bad activity, but rarely misclassifying good activity as bad
>>>>> activity). You can weight the different classes currently by setting the
>>>>> sample weight on good activity points to be higher than those of bad
>>>>> activity points. The classifier will automatically try to fit the good
>>>>> activity better.
>>>>>
>>>>> If I misinterpreted what you asked, let me know. I wasn't exactly sure
>>>>> what your first sentence meant.
>>>>>
>>>>> On Thu, Aug 20, 2015 at 10:53 AM, Rex X <dnsr...@gmail.com> wrote:
>>>>>
>>>>>> Very nice! Thanks to both of you, Jacob and Andreas!
>>>>>>
>>>>>> Andreas, yes, I'm interested in all leafs. The additional Pandas
>>>>>> query done on each leaf node is a further check to inspect whether this
>>>>>> leaf node can be of interest or not.
>>>>>>
>>>>>> Binary classification for example, fraud detection to be more
>>>>>> specific, we can set up sample_weight based on the order amount in the
>>>>>> first run, to find those large-sale segments. But we don't want a rule 
>>>>>> that
>>>>>> decline too many good orders in the meantime. That means, we want to
>>>>>> minimize both the number and amount of good samples in each leaf node. 
>>>>>> This
>>>>>> job seems cannot be done in the current decisionTree implementation. 
>>>>>> That's
>>>>>> why we need to fire a further Pandas query to do this job.
>>>>>>
>>>>>> Any comments?
>>>>>>
>>>>>> Best,
>>>>>> Rex
>>>>>>
>>>>>>
>>>>>> On Tue, Aug 18, 2015 at 9:23 AM, Jacob Schreiber <
>>>>>> jmschreibe...@gmail.com> wrote:
>>>>>>
>>>>>>> There is no code to do it automatically, but you can use the
>>>>>>> following to get array of thresholds:
>>>>>>>
>>>>>>> ```
>>>>>>> clf = DecisionTreeClassifier()
>>>>>>> clf.fit( datums, targets )
>>>>>>> clf.tree_.thresholds
>>>>>>> ```
>>>>>>>
>>>>>>> The full list of attributes you can call are (feature, threshold,
>>>>>>> impurity, n_node_samples, weighted_n_node_samples, value (the 
>>>>>>> prediction),
>>>>>>> children_left, children_right).
>>>>>>>
>>>>>>> Does this help?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Aug 18, 2015 at 7:53 AM, Andreas Mueller <t3k...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I'm not aware of any ready-made code. But you can just get the
>>>>>>>> boolean matrix by using ``apply`` and a one-hot encoder.
>>>>>>>> Why are you interested in a single leave? the query seems to be
>>>>>>>> able to return "only" a single boolean.
>>>>>>>> It is probably more efficient to traverse the full tree for each
>>>>>>>> data point if you are interested in all the leafs.
>>>>>>>>
>>>>>>>>
>>>>>>>> On 08/18/2015 11:39 AM, Rex X wrote:
>>>>>>>>
>>>>>>>> Hi everyone,
>>>>>>>>
>>>>>>>> Is it possible to extract the decision tree rule associated with
>>>>>>>> each leaf node into a Pandas Dataframe query
>>>>>>>> <http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.query.html>?
>>>>>>>> So that one can view the corresponding Dataframe content by feeding in 
>>>>>>>> the
>>>>>>>> decision rule.
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> Rex
>>>>>>>>
>>>>>>>>
>>>>>>>> ------------------------------------------------------------------------------
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> Scikit-learn-general mailing 
>>>>>>>> listScikit-learn-general@lists.sourceforge.nethttps://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> ------------------------------------------------------------------------------
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> Scikit-learn-general mailing list
>>>>>>>> Scikit-learn-general@lists.sourceforge.net
>>>>>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> ------------------------------------------------------------------------------
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Scikit-learn-general mailing list
>>>>>>> Scikit-learn-general@lists.sourceforge.net
>>>>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> ------------------------------------------------------------------------------
>>>>>>
>>>>>> _______________________________________________
>>>>>> Scikit-learn-general mailing list
>>>>>> Scikit-learn-general@lists.sourceforge.net
>>>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> ------------------------------------------------------------------------------
>>>>>
>>>>> _______________________________________________
>>>>> Scikit-learn-general mailing list
>>>>> Scikit-learn-general@lists.sourceforge.net
>>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>>
>>>>>
>>>>
>>>>
>>>> ------------------------------------------------------------------------------
>>>>
>>>> _______________________________________________
>>>> Scikit-learn-general mailing list
>>>> Scikit-learn-general@lists.sourceforge.net
>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>
>>>>
>>>
>>>
>>> ------------------------------------------------------------------------------
>>>
>>> _______________________________________________
>>> Scikit-learn-general mailing list
>>> Scikit-learn-general@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>
>>>
>>
>> ------------------------------------------------------------------------------
>> _______________________________________________
>> Scikit-learn-general mailing list
>> Scikit-learn-general@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>
>
> ------------------------------------------------------------------------------
>
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>

------------------------------------------------------------------------------

_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] How to extract the decision tree rule of each leaf node into Pandas Dataframe query?

Reply via email to