Decision trees use Gini impurity by default. You can set it to entropy using the following;
clf = DecisionTreeClassifier( criterion='entropy' ) You can impose those constraints in the following way; clf = DecisionTreeClassifier( min_weight_fraction_leaf=0.05, class_weight={0 : 0.30, 1: 0.70} ) This should make all decision leaves have to have more than 0.05 weight, and adjust the sample weights such that 30% of the weight being fraud is equivalent to 70% of the weight being good, and 30.00001% causes the leaf to classify as fraud. Again, I haven't tested the code, but conceptually it should work On Fri, Aug 28, 2015 at 2:54 AM, Rex X <dnsr...@gmail.com> wrote: > Brian, > > That is great to have query rules from a decision tree. > > > Back to the original question, is there any native way to make the > decision tree split following "gini or entropy" criterion, and satisfying > the two fraud detection conditions, with leaf nodes with *fraud_usd_leaf *>= > 0.05, and *fraud_rate_leaf* >= 30%, in the beginning of the split? > > The strategy discussed by Jacob and me can work for small data set. But it > is hard work when the data set and number of categorical features grow. > > Any comments will be greatly welcome! > > > Best, > Rex > > > On Thu, Aug 27, 2015 at 4:06 PM, Brian Scannell <brianjscann...@gmail.com> > wrote: > >> Rex, >> >> For extracting decision rules as a Pandas query, here is some sample code >> with a test case that should work. No promises though. >> >> ``` >> import pandas as pd >> from sklearn.datasets import load_iris >> from sklearn import tree >> import sklearn >> >> def get_queries(clf, feature_names): >> def recurse(node_id, rules): >> left = clf.tree_.children_left[node_id] >> right = clf.tree_.children_right[node_id] >> >> #Check if this is a decision node >> if left != tree._tree.TREE_LEAF: >> feature = feature_names[clf.tree_.feature[node_id]] >> new_rule = feature + " {0} " + "%.4f" % >> clf.tree_.threshold[node_id] >> return (recurse(left, rules + [new_rule.format('<=')]) + >> recurse(right, rules + [new_rule.format('>')])) >> else: # Leaf >> return [" and ".join(rules)] >> >> return recurse(node_id=0, rules=[]) >> >> iris = sklearn.datasets.load_iris() >> clf = tree.DecisionTreeClassifier() >> clf.fit(iris.data, iris.target) >> >> # pandas queries must have valid python identifiers for column names >> feature_names = [name.replace(' ','_').replace('(','').replace(')','') >> for name in iris.feature_names] >> leaf_queries = get_queries(clf, feature_names) >> >> iris_df = pd.DataFrame(iris.data, columns=feature_names) >> >> print("Total queried samples: {0}".format(sum([len(iris_df.query(query)) >> for query in leaf_queries]))) >> ``` >> >> On Thu, Aug 27, 2015 at 6:15 PM Rex X <dnsr...@gmail.com> wrote: >> >>> Hi Jocob, >>> >>> That is cool! Very helpful. >>> >>> In further, based on your idea, I can do a loop with random split and >>> automatically find those leaf nodes satisfying the two fraud detect >>> conditions. >>> >>> Here is one raised question. How to extract the associated decision >>> rules to one selected leaf node? >>> >>> Usually the decision rules need to be merged from the split rules after >>> the root node all the way down to this leaf node. It is hard work to select >>> each of them manually. >>> >>> >>> Best, >>> Rex >>> >>> >>> >>> >>> On Thu, Aug 27, 2015 at 12:32 PM, Jacob Schreiber < >>> jmschreibe...@gmail.com> wrote: >>> >>>> Hi Rex >>>> >>>> I would set up the problem in the same way. >>>> >>>> Look at http://scikit-learn.org/stable/modules/tree.html. The >>>> visualization should be of use to you, where you can manually inspect >>>> good_usd_leaf and fraud_usd_leaf. >>>> >>>> If you want to do this automatically, you should look at >>>> clf.tree_.value(), which is the vector of these values. You can reshape >>>> this vector into 2xn, where each row is [good_usd_leaf, fraud_usd_leaf]. If >>>> you use two copies, one for the raw values, and one for the percentages, >>>> you can easily filter by your two rules. >>>> >>>> I'm not testing this code, so it probably has some minor mistakes, but >>>> it should look ~something~ like the following; >>>> >>>> ``` >>>> clf = DecisionTreeClassifier() >>>> clf.fit( datums ) >>>> >>>> n = datums.shape[0] >>>> raw = clf.tree_.value().reshape(2,n) >>>> norm =clf.tree_.value().reshape(2,n) >>>> norm /= norm.sum( axis=1 ) >>>> >>>> print np.arange(n)[ (raw[:,1] >= 0.05) && (norm[:,1] >= 0.30) ] >>>> ``` >>>> >>>> I hope that makes sense! >>>> Let me know if you have any other questions >>>> >>>> On Thu, Aug 27, 2015 at 11:12 AM, Rex X <dnsr...@gmail.com> wrote: >>>> >>>>> Hi Jacob, >>>>> >>>>> Let's consider one leaf node with three order transactions, one order >>>>> is good ($30), and the other two are fraud ($35 + $35 = $70 fraud in >>>>> total). >>>>> >>>>> The two class_weights are in equal weight, {'0':1, '1':1}, in which >>>>> class '0' labels good, and the class '1' labels a fraud. The two classes >>>>> are imbalanced, say, we have $100 fraud (from 10 orders), and $900 good >>>>> transactions (from 1000 order). >>>>> >>>>> The sample_weight >>>>> <http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html> >>>>> is the normalized order amount. Assuming this superstore has a total sales >>>>> of $1000. That means, this leaf node, has a value of [0.03, 0.07], and >>>>> let's denote this value by [good_usd_leaf, *fraud_usd_leaf*]. >>>>> >>>>> The fraud rate of this leaf node is >>>>> fraud_rate_store=7% >>>>> for the whole superstore, $70/$1000=7%, and >>>>> *f**raud_rate_leaf*=70% >>>>> in the context of this leaf node only, $70/($30+$70) = 70%. >>>>> >>>>> If we make a rule to decline orders based on the decision rules >>>>> leading to this leaf node, we can save $70 bucks, or 7% for this >>>>> superstore >>>>> sales. >>>>> >>>>> With decision tree, it is very easy to find decision rules leading to >>>>> good orders, since we have so many samples. However, we have no interest >>>>> with those good transactions. >>>>> >>>>> We want to find out all leaf nodes with *fraud_usd_leaf *>= 0.05, and >>>>> *fraud_rate_leaf* >= 30%. >>>>> >>>>> Any ideas what strategies can be applied for this fraud detection >>>>> problem? >>>>> >>>>> >>>>> Best, >>>>> Rex >>>>> >>>>> >>>>> On Thu, Aug 20, 2015 at 2:09 AM, Jacob Schreiber < >>>>> jmschreibe...@gmail.com> wrote: >>>>> >>>>>> It sounds like you prefer false negatives over false positives (not >>>>>> catching bad activity, but rarely misclassifying good activity as bad >>>>>> activity). You can weight the different classes currently by setting the >>>>>> sample weight on good activity points to be higher than those of bad >>>>>> activity points. The classifier will automatically try to fit the good >>>>>> activity better. >>>>>> >>>>>> If I misinterpreted what you asked, let me know. I wasn't exactly >>>>>> sure what your first sentence meant. >>>>>> >>>>>> On Thu, Aug 20, 2015 at 10:53 AM, Rex X <dnsr...@gmail.com> wrote: >>>>>> >>>>>>> Very nice! Thanks to both of you, Jacob and Andreas! >>>>>>> >>>>>>> Andreas, yes, I'm interested in all leafs. The additional Pandas >>>>>>> query done on each leaf node is a further check to inspect whether this >>>>>>> leaf node can be of interest or not. >>>>>>> >>>>>>> Binary classification for example, fraud detection to be more >>>>>>> specific, we can set up sample_weight based on the order amount in the >>>>>>> first run, to find those large-sale segments. But we don't want a rule >>>>>>> that >>>>>>> decline too many good orders in the meantime. That means, we want to >>>>>>> minimize both the number and amount of good samples in each leaf node. >>>>>>> This >>>>>>> job seems cannot be done in the current decisionTree implementation. >>>>>>> That's >>>>>>> why we need to fire a further Pandas query to do this job. >>>>>>> >>>>>>> Any comments? >>>>>>> >>>>>>> Best, >>>>>>> Rex >>>>>>> >>>>>>> >>>>>>> On Tue, Aug 18, 2015 at 9:23 AM, Jacob Schreiber < >>>>>>> jmschreibe...@gmail.com> wrote: >>>>>>> >>>>>>>> There is no code to do it automatically, but you can use the >>>>>>>> following to get array of thresholds: >>>>>>>> >>>>>>>> ``` >>>>>>>> clf = DecisionTreeClassifier() >>>>>>>> clf.fit( datums, targets ) >>>>>>>> clf.tree_.thresholds >>>>>>>> ``` >>>>>>>> >>>>>>>> The full list of attributes you can call are (feature, threshold, >>>>>>>> impurity, n_node_samples, weighted_n_node_samples, value (the >>>>>>>> prediction), >>>>>>>> children_left, children_right). >>>>>>>> >>>>>>>> Does this help? >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Tue, Aug 18, 2015 at 7:53 AM, Andreas Mueller <t3k...@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> I'm not aware of any ready-made code. But you can just get the >>>>>>>>> boolean matrix by using ``apply`` and a one-hot encoder. >>>>>>>>> Why are you interested in a single leave? the query seems to be >>>>>>>>> able to return "only" a single boolean. >>>>>>>>> It is probably more efficient to traverse the full tree for each >>>>>>>>> data point if you are interested in all the leafs. >>>>>>>>> >>>>>>>>> >>>>>>>>> On 08/18/2015 11:39 AM, Rex X wrote: >>>>>>>>> >>>>>>>>> Hi everyone, >>>>>>>>> >>>>>>>>> Is it possible to extract the decision tree rule associated with >>>>>>>>> each leaf node into a Pandas Dataframe query >>>>>>>>> <http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.query.html>? >>>>>>>>> So that one can view the corresponding Dataframe content by feeding >>>>>>>>> in the >>>>>>>>> decision rule. >>>>>>>>> >>>>>>>>> Best, >>>>>>>>> Rex >>>>>>>>> >>>>>>>>> >>>>>>>>> ------------------------------------------------------------------------------ >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> Scikit-learn-general mailing >>>>>>>>> listScikit-learn-general@lists.sourceforge.nethttps://lists.sourceforge.net/lists/listinfo/scikit-learn-general >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> ------------------------------------------------------------------------------ >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> Scikit-learn-general mailing list >>>>>>>>> Scikit-learn-general@lists.sourceforge.net >>>>>>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> ------------------------------------------------------------------------------ >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> Scikit-learn-general mailing list >>>>>>>> Scikit-learn-general@lists.sourceforge.net >>>>>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>>> ------------------------------------------------------------------------------ >>>>>>> >>>>>>> _______________________________________________ >>>>>>> Scikit-learn-general mailing list >>>>>>> Scikit-learn-general@lists.sourceforge.net >>>>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> ------------------------------------------------------------------------------ >>>>>> >>>>>> _______________________________________________ >>>>>> Scikit-learn-general mailing list >>>>>> Scikit-learn-general@lists.sourceforge.net >>>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general >>>>>> >>>>>> >>>>> >>>>> >>>>> ------------------------------------------------------------------------------ >>>>> >>>>> _______________________________________________ >>>>> Scikit-learn-general mailing list >>>>> Scikit-learn-general@lists.sourceforge.net >>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general >>>>> >>>>> >>>> >>>> >>>> ------------------------------------------------------------------------------ >>>> >>>> _______________________________________________ >>>> Scikit-learn-general mailing list >>>> Scikit-learn-general@lists.sourceforge.net >>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general >>>> >>>> >>> >>> ------------------------------------------------------------------------------ >>> _______________________________________________ >>> Scikit-learn-general mailing list >>> Scikit-learn-general@lists.sourceforge.net >>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general >>> >> >> >> ------------------------------------------------------------------------------ >> >> _______________________________________________ >> Scikit-learn-general mailing list >> Scikit-learn-general@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general >> >> > > > ------------------------------------------------------------------------------ > > _______________________________________________ > Scikit-learn-general mailing list > Scikit-learn-general@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general > >
------------------------------------------------------------------------------
_______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general