Rex, For extracting decision rules as a Pandas query, here is some sample code with a test case that should work. No promises though.
``` import pandas as pd from sklearn.datasets import load_iris from sklearn import tree import sklearn def get_queries(clf, feature_names): def recurse(node_id, rules): left = clf.tree_.children_left[node_id] right = clf.tree_.children_right[node_id] #Check if this is a decision node if left != tree._tree.TREE_LEAF: feature = feature_names[clf.tree_.feature[node_id]] new_rule = feature + " {0} " + "%.4f" % clf.tree_.threshold[node_id] return (recurse(left, rules + [new_rule.format('<=')]) + recurse(right, rules + [new_rule.format('>')])) else: # Leaf return [" and ".join(rules)] return recurse(node_id=0, rules=[]) iris = sklearn.datasets.load_iris() clf = tree.DecisionTreeClassifier() clf.fit(iris.data, iris.target) # pandas queries must have valid python identifiers for column names feature_names = [name.replace(' ','_').replace('(','').replace(')','') for name in iris.feature_names] leaf_queries = get_queries(clf, feature_names) iris_df = pd.DataFrame(iris.data, columns=feature_names) print("Total queried samples: {0}".format(sum([len(iris_df.query(query)) for query in leaf_queries]))) ``` On Thu, Aug 27, 2015 at 6:15 PM Rex X <dnsr...@gmail.com> wrote: > Hi Jocob, > > That is cool! Very helpful. > > In further, based on your idea, I can do a loop with random split and > automatically find those leaf nodes satisfying the two fraud detect > conditions. > > Here is one raised question. How to extract the associated decision rules > to one selected leaf node? > > Usually the decision rules need to be merged from the split rules after > the root node all the way down to this leaf node. It is hard work to select > each of them manually. > > > Best, > Rex > > > > > On Thu, Aug 27, 2015 at 12:32 PM, Jacob Schreiber <jmschreibe...@gmail.com > > wrote: > >> Hi Rex >> >> I would set up the problem in the same way. >> >> Look at http://scikit-learn.org/stable/modules/tree.html. The >> visualization should be of use to you, where you can manually inspect >> good_usd_leaf and fraud_usd_leaf. >> >> If you want to do this automatically, you should look at >> clf.tree_.value(), which is the vector of these values. You can reshape >> this vector into 2xn, where each row is [good_usd_leaf, fraud_usd_leaf]. If >> you use two copies, one for the raw values, and one for the percentages, >> you can easily filter by your two rules. >> >> I'm not testing this code, so it probably has some minor mistakes, but it >> should look ~something~ like the following; >> >> ``` >> clf = DecisionTreeClassifier() >> clf.fit( datums ) >> >> n = datums.shape[0] >> raw = clf.tree_.value().reshape(2,n) >> norm =clf.tree_.value().reshape(2,n) >> norm /= norm.sum( axis=1 ) >> >> print np.arange(n)[ (raw[:,1] >= 0.05) && (norm[:,1] >= 0.30) ] >> ``` >> >> I hope that makes sense! >> Let me know if you have any other questions >> >> On Thu, Aug 27, 2015 at 11:12 AM, Rex X <dnsr...@gmail.com> wrote: >> >>> Hi Jacob, >>> >>> Let's consider one leaf node with three order transactions, one order is >>> good ($30), and the other two are fraud ($35 + $35 = $70 fraud in total). >>> >>> The two class_weights are in equal weight, {'0':1, '1':1}, in which >>> class '0' labels good, and the class '1' labels a fraud. The two classes >>> are imbalanced, say, we have $100 fraud (from 10 orders), and $900 good >>> transactions (from 1000 order). >>> >>> The sample_weight >>> <http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html> >>> is the normalized order amount. Assuming this superstore has a total sales >>> of $1000. That means, this leaf node, has a value of [0.03, 0.07], and >>> let's denote this value by [good_usd_leaf, *fraud_usd_leaf*]. >>> >>> The fraud rate of this leaf node is >>> fraud_rate_store=7% >>> for the whole superstore, $70/$1000=7%, and >>> *f**raud_rate_leaf*=70% >>> in the context of this leaf node only, $70/($30+$70) = 70%. >>> >>> If we make a rule to decline orders based on the decision rules leading >>> to this leaf node, we can save $70 bucks, or 7% for this superstore sales. >>> >>> With decision tree, it is very easy to find decision rules leading to >>> good orders, since we have so many samples. However, we have no interest >>> with those good transactions. >>> >>> We want to find out all leaf nodes with *fraud_usd_leaf *>= 0.05, and >>> *fraud_rate_leaf* >= 30%. >>> >>> Any ideas what strategies can be applied for this fraud detection >>> problem? >>> >>> >>> Best, >>> Rex >>> >>> >>> On Thu, Aug 20, 2015 at 2:09 AM, Jacob Schreiber < >>> jmschreibe...@gmail.com> wrote: >>> >>>> It sounds like you prefer false negatives over false positives (not >>>> catching bad activity, but rarely misclassifying good activity as bad >>>> activity). You can weight the different classes currently by setting the >>>> sample weight on good activity points to be higher than those of bad >>>> activity points. The classifier will automatically try to fit the good >>>> activity better. >>>> >>>> If I misinterpreted what you asked, let me know. I wasn't exactly sure >>>> what your first sentence meant. >>>> >>>> On Thu, Aug 20, 2015 at 10:53 AM, Rex X <dnsr...@gmail.com> wrote: >>>> >>>>> Very nice! Thanks to both of you, Jacob and Andreas! >>>>> >>>>> Andreas, yes, I'm interested in all leafs. The additional Pandas query >>>>> done on each leaf node is a further check to inspect whether this leaf >>>>> node >>>>> can be of interest or not. >>>>> >>>>> Binary classification for example, fraud detection to be more >>>>> specific, we can set up sample_weight based on the order amount in the >>>>> first run, to find those large-sale segments. But we don't want a rule >>>>> that >>>>> decline too many good orders in the meantime. That means, we want to >>>>> minimize both the number and amount of good samples in each leaf node. >>>>> This >>>>> job seems cannot be done in the current decisionTree implementation. >>>>> That's >>>>> why we need to fire a further Pandas query to do this job. >>>>> >>>>> Any comments? >>>>> >>>>> Best, >>>>> Rex >>>>> >>>>> >>>>> On Tue, Aug 18, 2015 at 9:23 AM, Jacob Schreiber < >>>>> jmschreibe...@gmail.com> wrote: >>>>> >>>>>> There is no code to do it automatically, but you can use the >>>>>> following to get array of thresholds: >>>>>> >>>>>> ``` >>>>>> clf = DecisionTreeClassifier() >>>>>> clf.fit( datums, targets ) >>>>>> clf.tree_.thresholds >>>>>> ``` >>>>>> >>>>>> The full list of attributes you can call are (feature, threshold, >>>>>> impurity, n_node_samples, weighted_n_node_samples, value (the >>>>>> prediction), >>>>>> children_left, children_right). >>>>>> >>>>>> Does this help? >>>>>> >>>>>> >>>>>> >>>>>> On Tue, Aug 18, 2015 at 7:53 AM, Andreas Mueller <t3k...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> I'm not aware of any ready-made code. But you can just get the >>>>>>> boolean matrix by using ``apply`` and a one-hot encoder. >>>>>>> Why are you interested in a single leave? the query seems to be able >>>>>>> to return "only" a single boolean. >>>>>>> It is probably more efficient to traverse the full tree for each >>>>>>> data point if you are interested in all the leafs. >>>>>>> >>>>>>> >>>>>>> On 08/18/2015 11:39 AM, Rex X wrote: >>>>>>> >>>>>>> Hi everyone, >>>>>>> >>>>>>> Is it possible to extract the decision tree rule associated with >>>>>>> each leaf node into a Pandas Dataframe query >>>>>>> <http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.query.html>? >>>>>>> So that one can view the corresponding Dataframe content by feeding in >>>>>>> the >>>>>>> decision rule. >>>>>>> >>>>>>> Best, >>>>>>> Rex >>>>>>> >>>>>>> >>>>>>> ------------------------------------------------------------------------------ >>>>>>> >>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> Scikit-learn-general mailing >>>>>>> listScikit-learn-general@lists.sourceforge.nethttps://lists.sourceforge.net/lists/listinfo/scikit-learn-general >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> ------------------------------------------------------------------------------ >>>>>>> >>>>>>> _______________________________________________ >>>>>>> Scikit-learn-general mailing list >>>>>>> Scikit-learn-general@lists.sourceforge.net >>>>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> ------------------------------------------------------------------------------ >>>>>> >>>>>> _______________________________________________ >>>>>> Scikit-learn-general mailing list >>>>>> Scikit-learn-general@lists.sourceforge.net >>>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general >>>>>> >>>>>> >>>>> >>>>> >>>>> ------------------------------------------------------------------------------ >>>>> >>>>> _______________________________________________ >>>>> Scikit-learn-general mailing list >>>>> Scikit-learn-general@lists.sourceforge.net >>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general >>>>> >>>>> >>>> >>>> >>>> ------------------------------------------------------------------------------ >>>> >>>> _______________________________________________ >>>> Scikit-learn-general mailing list >>>> Scikit-learn-general@lists.sourceforge.net >>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general >>>> >>>> >>> >>> >>> ------------------------------------------------------------------------------ >>> >>> _______________________________________________ >>> Scikit-learn-general mailing list >>> Scikit-learn-general@lists.sourceforge.net >>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general >>> >>> >> >> >> ------------------------------------------------------------------------------ >> >> _______________________________________________ >> Scikit-learn-general mailing list >> Scikit-learn-general@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general >> >> > > ------------------------------------------------------------------------------ > _______________________________________________ > Scikit-learn-general mailing list > Scikit-learn-general@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general >
------------------------------------------------------------------------------
_______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general