Re: [Scikit-learn-general] How to extract the decision tree rule of each leaf node into Pandas Dataframe query?

Jacob Schreiber Fri, 28 Aug 2015 00:25:59 -0700

Decision trees use Gini impurity by default. You can set it to entropy
using the following;


clf = DecisionTreeClassifier( criterion='entropy' )

You can impose those constraints in the following way;

clf = DecisionTreeClassifier( min_weight_fraction_leaf=0.05,
class_weight={0 : 0.30, 1: 0.70} )

This should make all decision leaves have to have more than 0.05 weight,
and adjust the sample weights such that 30% of the weight being fraud is
equivalent to 70% of the weight being good, and 30.00001% causes the leaf
to classify as fraud.

Again, I haven't tested the code, but conceptually it should work


On Fri, Aug 28, 2015 at 2:54 AM, Rex X <dnsr...@gmail.com> wrote:

> Brian,
>
> That is great to have query rules from a decision tree.
>
>
> Back to the original question, is there any native way to make the
> decision tree split following "gini or entropy" criterion, and satisfying
> the two fraud detection conditions, with leaf nodes with *fraud_usd_leaf *>=
> 0.05, and *fraud_rate_leaf* >= 30%, in the beginning of the split?
>
> The strategy discussed by Jacob and me can work for small data set. But it
> is hard work when the data set and number of categorical features grow.
>
> Any comments will be greatly welcome!
>
>
> Best,
> Rex
>
>
> On Thu, Aug 27, 2015 at 4:06 PM, Brian Scannell <brianjscann...@gmail.com>
> wrote:
>
>> Rex,
>>
>> For extracting decision rules as a Pandas query, here is some sample code
>> with a test case that should work. No promises though.
>>
>> ```
>> import pandas as pd
>> from sklearn.datasets import load_iris
>> from sklearn import tree
>> import sklearn
>>
>> def get_queries(clf, feature_names):
>>     def recurse(node_id, rules):
>>         left = clf.tree_.children_left[node_id]
>>         right = clf.tree_.children_right[node_id]
>>
>>         #Check if this is a decision node
>>         if left != tree._tree.TREE_LEAF:
>>             feature = feature_names[clf.tree_.feature[node_id]]
>>             new_rule = feature + " {0} " + "%.4f" %
>> clf.tree_.threshold[node_id]
>>             return (recurse(left, rules + [new_rule.format('<=')]) +
>>                     recurse(right, rules + [new_rule.format('>')]))
>>         else: # Leaf
>>             return [" and ".join(rules)]
>>
>>     return recurse(node_id=0, rules=[])
>>
>> iris = sklearn.datasets.load_iris()
>> clf = tree.DecisionTreeClassifier()
>> clf.fit(iris.data, iris.target)
>>
>> # pandas queries must have valid python identifiers for column names
>> feature_names = [name.replace(' ','_').replace('(','').replace(')','')
>> for name in iris.feature_names]
>> leaf_queries = get_queries(clf, feature_names)
>>
>> iris_df = pd.DataFrame(iris.data, columns=feature_names)
>>
>> print("Total queried samples: {0}".format(sum([len(iris_df.query(query))
>> for query in leaf_queries])))
>> ```
>>
>> On Thu, Aug 27, 2015 at 6:15 PM Rex X <dnsr...@gmail.com> wrote:
>>
>>> Hi Jocob,
>>>
>>> That is cool! Very helpful.
>>>
>>> In further, based on your idea, I can do a loop with random split and
>>> automatically find those leaf nodes satisfying the two fraud detect
>>> conditions.
>>>
>>> Here is one raised question. How to extract the associated decision
>>> rules to one selected leaf node?
>>>
>>> Usually the decision rules need to be merged from the split rules after
>>> the root node all the way down to this leaf node. It is hard work to select
>>> each of them manually.
>>>
>>>
>>> Best,
>>> Rex
>>>
>>>
>>>
>>>
>>> On Thu, Aug 27, 2015 at 12:32 PM, Jacob Schreiber <
>>> jmschreibe...@gmail.com> wrote:
>>>
>>>> Hi Rex
>>>>
>>>> I would set up the problem in the same way.
>>>>
>>>> Look at http://scikit-learn.org/stable/modules/tree.html. The
>>>> visualization should be of use to you, where you can manually inspect
>>>> good_usd_leaf and fraud_usd_leaf.
>>>>
>>>> If you want to do this automatically, you should look at
>>>> clf.tree_.value(), which is the vector of these values. You can reshape
>>>> this vector into 2xn, where each row is [good_usd_leaf, fraud_usd_leaf]. If
>>>> you use two copies, one for the raw values, and one for the percentages,
>>>> you can easily filter by your two rules.
>>>>
>>>> I'm not testing this code, so it probably has some minor mistakes, but
>>>> it should look ~something~ like the following;
>>>>
>>>> ```
>>>> clf = DecisionTreeClassifier()
>>>> clf.fit( datums )
>>>>
>>>> n = datums.shape[0]
>>>> raw = clf.tree_.value().reshape(2,n)
>>>> norm =clf.tree_.value().reshape(2,n)
>>>> norm /= norm.sum( axis=1 )
>>>>
>>>> print np.arange(n)[ (raw[:,1] >= 0.05) && (norm[:,1] >= 0.30) ]
>>>> ```
>>>>
>>>> I hope that makes sense!
>>>> Let me know if you have any other questions
>>>>
>>>> On Thu, Aug 27, 2015 at 11:12 AM, Rex X <dnsr...@gmail.com> wrote:
>>>>
>>>>> Hi Jacob,
>>>>>
>>>>> Let's consider one leaf node with three order transactions, one order
>>>>> is good ($30), and the other two are fraud ($35 + $35 = $70 fraud in
>>>>> total).
>>>>>
>>>>> The two class_weights are in equal weight, {'0':1, '1':1}, in which
>>>>> class '0' labels good, and the class '1' labels a fraud. The two classes
>>>>> are imbalanced, say, we have $100 fraud (from 10 orders), and $900 good
>>>>> transactions (from 1000 order).
>>>>>
>>>>> The sample_weight
>>>>> <http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html>
>>>>> is the normalized order amount. Assuming this superstore has a total sales
>>>>> of $1000. That means, this leaf node, has a value of [0.03, 0.07], and
>>>>> let's denote this value by [good_usd_leaf, *fraud_usd_leaf*].
>>>>>
>>>>> The fraud rate of this leaf node is
>>>>> fraud_rate_store=7%
>>>>> for the whole superstore, $70/$1000=7%, and
>>>>> *f**raud_rate_leaf*=70%
>>>>> in the context of this leaf node only, $70/($30+$70) = 70%.
>>>>>
>>>>> If we make a rule to decline orders based on the decision rules
>>>>> leading to this leaf node, we can save $70 bucks, or 7% for this 
>>>>> superstore
>>>>> sales.
>>>>>
>>>>> With decision tree, it is very easy to find decision rules leading to
>>>>> good orders, since we have so many samples. However, we have no interest
>>>>> with those good transactions.
>>>>>
>>>>> We want to find out all leaf nodes with *fraud_usd_leaf *>= 0.05, and
>>>>> *fraud_rate_leaf* >= 30%.
>>>>>
>>>>> Any ideas what strategies can be applied for this fraud detection
>>>>> problem?
>>>>>
>>>>>
>>>>> Best,
>>>>> Rex
>>>>>
>>>>>
>>>>> On Thu, Aug 20, 2015 at 2:09 AM, Jacob Schreiber <
>>>>> jmschreibe...@gmail.com> wrote:
>>>>>
>>>>>> It sounds like you prefer false negatives over false positives (not
>>>>>> catching bad activity, but rarely misclassifying good activity as bad
>>>>>> activity). You can weight the different classes currently by setting the
>>>>>> sample weight on good activity points to be higher than those of bad
>>>>>> activity points. The classifier will automatically try to fit the good
>>>>>> activity better.
>>>>>>
>>>>>> If I misinterpreted what you asked, let me know. I wasn't exactly
>>>>>> sure what your first sentence meant.
>>>>>>
>>>>>> On Thu, Aug 20, 2015 at 10:53 AM, Rex X <dnsr...@gmail.com> wrote:
>>>>>>
>>>>>>> Very nice! Thanks to both of you, Jacob and Andreas!
>>>>>>>
>>>>>>> Andreas, yes, I'm interested in all leafs. The additional Pandas
>>>>>>> query done on each leaf node is a further check to inspect whether this
>>>>>>> leaf node can be of interest or not.
>>>>>>>
>>>>>>> Binary classification for example, fraud detection to be more
>>>>>>> specific, we can set up sample_weight based on the order amount in the
>>>>>>> first run, to find those large-sale segments. But we don't want a rule 
>>>>>>> that
>>>>>>> decline too many good orders in the meantime. That means, we want to
>>>>>>> minimize both the number and amount of good samples in each leaf node. 
>>>>>>> This
>>>>>>> job seems cannot be done in the current decisionTree implementation. 
>>>>>>> That's
>>>>>>> why we need to fire a further Pandas query to do this job.
>>>>>>>
>>>>>>> Any comments?
>>>>>>>
>>>>>>> Best,
>>>>>>> Rex
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Aug 18, 2015 at 9:23 AM, Jacob Schreiber <
>>>>>>> jmschreibe...@gmail.com> wrote:
>>>>>>>
>>>>>>>> There is no code to do it automatically, but you can use the
>>>>>>>> following to get array of thresholds:
>>>>>>>>
>>>>>>>> ```
>>>>>>>> clf = DecisionTreeClassifier()
>>>>>>>> clf.fit( datums, targets )
>>>>>>>> clf.tree_.thresholds
>>>>>>>> ```
>>>>>>>>
>>>>>>>> The full list of attributes you can call are (feature, threshold,
>>>>>>>> impurity, n_node_samples, weighted_n_node_samples, value (the 
>>>>>>>> prediction),
>>>>>>>> children_left, children_right).
>>>>>>>>
>>>>>>>> Does this help?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Aug 18, 2015 at 7:53 AM, Andreas Mueller <t3k...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> I'm not aware of any ready-made code. But you can just get the
>>>>>>>>> boolean matrix by using ``apply`` and a one-hot encoder.
>>>>>>>>> Why are you interested in a single leave? the query seems to be
>>>>>>>>> able to return "only" a single boolean.
>>>>>>>>> It is probably more efficient to traverse the full tree for each
>>>>>>>>> data point if you are interested in all the leafs.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 08/18/2015 11:39 AM, Rex X wrote:
>>>>>>>>>
>>>>>>>>> Hi everyone,
>>>>>>>>>
>>>>>>>>> Is it possible to extract the decision tree rule associated with
>>>>>>>>> each leaf node into a Pandas Dataframe query
>>>>>>>>> <http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.query.html>?
>>>>>>>>> So that one can view the corresponding Dataframe content by feeding 
>>>>>>>>> in the
>>>>>>>>> decision rule.
>>>>>>>>>
>>>>>>>>> Best,
>>>>>>>>> Rex
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ------------------------------------------------------------------------------
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> Scikit-learn-general mailing 
>>>>>>>>> listScikit-learn-general@lists.sourceforge.nethttps://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ------------------------------------------------------------------------------
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> Scikit-learn-general mailing list
>>>>>>>>> Scikit-learn-general@lists.sourceforge.net
>>>>>>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> ------------------------------------------------------------------------------
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> Scikit-learn-general mailing list
>>>>>>>> Scikit-learn-general@lists.sourceforge.net
>>>>>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> ------------------------------------------------------------------------------
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Scikit-learn-general mailing list
>>>>>>> Scikit-learn-general@lists.sourceforge.net
>>>>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> ------------------------------------------------------------------------------
>>>>>>
>>>>>> _______________________________________________
>>>>>> Scikit-learn-general mailing list
>>>>>> Scikit-learn-general@lists.sourceforge.net
>>>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> ------------------------------------------------------------------------------
>>>>>
>>>>> _______________________________________________
>>>>> Scikit-learn-general mailing list
>>>>> Scikit-learn-general@lists.sourceforge.net
>>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>>
>>>>>
>>>>
>>>>
>>>> ------------------------------------------------------------------------------
>>>>
>>>> _______________________________________________
>>>> Scikit-learn-general mailing list
>>>> Scikit-learn-general@lists.sourceforge.net
>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>
>>>>
>>>
>>> ------------------------------------------------------------------------------
>>> _______________________________________________
>>> Scikit-learn-general mailing list
>>> Scikit-learn-general@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>
>>
>>
>> ------------------------------------------------------------------------------
>>
>> _______________________________________________
>> Scikit-learn-general mailing list
>> Scikit-learn-general@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>>
>
>
> ------------------------------------------------------------------------------
>
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>

------------------------------------------------------------------------------

_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] How to extract the decision tree rule of each leaf node into Pandas Dataframe query?

Reply via email to