Hi Rex I would set up the problem in the same way.
Look at http://scikit-learn.org/stable/modules/tree.html. The visualization should be of use to you, where you can manually inspect good_usd_leaf and fraud_usd_leaf. If you want to do this automatically, you should look at clf.tree_.value(), which is the vector of these values. You can reshape this vector into 2xn, where each row is [good_usd_leaf, fraud_usd_leaf]. If you use two copies, one for the raw values, and one for the percentages, you can easily filter by your two rules. I'm not testing this code, so it probably has some minor mistakes, but it should look ~something~ like the following; ``` clf = DecisionTreeClassifier() clf.fit( datums ) n = datums.shape[0] raw = clf.tree_.value().reshape(2,n) norm =clf.tree_.value().reshape(2,n) norm /= norm.sum( axis=1 ) print np.arange(n)[ (raw[:,1] >= 0.05) && (norm[:,1] >= 0.30) ] ``` I hope that makes sense! Let me know if you have any other questions On Thu, Aug 27, 2015 at 11:12 AM, Rex X <dnsr...@gmail.com> wrote: > Hi Jacob, > > Let's consider one leaf node with three order transactions, one order is > good ($30), and the other two are fraud ($35 + $35 = $70 fraud in total). > > The two class_weights are in equal weight, {'0':1, '1':1}, in which class > '0' labels good, and the class '1' labels a fraud. The two classes are > imbalanced, say, we have $100 fraud (from 10 orders), and $900 good > transactions (from 1000 order). > > The sample_weight > <http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html> > is the normalized order amount. Assuming this superstore has a total sales > of $1000. That means, this leaf node, has a value of [0.03, 0.07], and > let's denote this value by [good_usd_leaf, *fraud_usd_leaf*]. > > The fraud rate of this leaf node is > fraud_rate_store=7% > for the whole superstore, $70/$1000=7%, and > *f**raud_rate_leaf*=70% > in the context of this leaf node only, $70/($30+$70) = 70%. > > If we make a rule to decline orders based on the decision rules leading to > this leaf node, we can save $70 bucks, or 7% for this superstore sales. > > With decision tree, it is very easy to find decision rules leading to good > orders, since we have so many samples. However, we have no interest with > those good transactions. > > We want to find out all leaf nodes with *fraud_usd_leaf *>= 0.05, and > *fraud_rate_leaf* >= 30%. > > Any ideas what strategies can be applied for this fraud detection problem? > > > Best, > Rex > > > On Thu, Aug 20, 2015 at 2:09 AM, Jacob Schreiber <jmschreibe...@gmail.com> > wrote: > >> It sounds like you prefer false negatives over false positives (not >> catching bad activity, but rarely misclassifying good activity as bad >> activity). You can weight the different classes currently by setting the >> sample weight on good activity points to be higher than those of bad >> activity points. The classifier will automatically try to fit the good >> activity better. >> >> If I misinterpreted what you asked, let me know. I wasn't exactly sure >> what your first sentence meant. >> >> On Thu, Aug 20, 2015 at 10:53 AM, Rex X <dnsr...@gmail.com> wrote: >> >>> Very nice! Thanks to both of you, Jacob and Andreas! >>> >>> Andreas, yes, I'm interested in all leafs. The additional Pandas query >>> done on each leaf node is a further check to inspect whether this leaf node >>> can be of interest or not. >>> >>> Binary classification for example, fraud detection to be more specific, >>> we can set up sample_weight based on the order amount in the first run, to >>> find those large-sale segments. But we don't want a rule that decline too >>> many good orders in the meantime. That means, we want to minimize both the >>> number and amount of good samples in each leaf node. This job seems cannot >>> be done in the current decisionTree implementation. That's why we need to >>> fire a further Pandas query to do this job. >>> >>> Any comments? >>> >>> Best, >>> Rex >>> >>> >>> On Tue, Aug 18, 2015 at 9:23 AM, Jacob Schreiber < >>> jmschreibe...@gmail.com> wrote: >>> >>>> There is no code to do it automatically, but you can use the following >>>> to get array of thresholds: >>>> >>>> ``` >>>> clf = DecisionTreeClassifier() >>>> clf.fit( datums, targets ) >>>> clf.tree_.thresholds >>>> ``` >>>> >>>> The full list of attributes you can call are (feature, threshold, >>>> impurity, n_node_samples, weighted_n_node_samples, value (the prediction), >>>> children_left, children_right). >>>> >>>> Does this help? >>>> >>>> >>>> >>>> On Tue, Aug 18, 2015 at 7:53 AM, Andreas Mueller <t3k...@gmail.com> >>>> wrote: >>>> >>>>> I'm not aware of any ready-made code. But you can just get the boolean >>>>> matrix by using ``apply`` and a one-hot encoder. >>>>> Why are you interested in a single leave? the query seems to be able >>>>> to return "only" a single boolean. >>>>> It is probably more efficient to traverse the full tree for each data >>>>> point if you are interested in all the leafs. >>>>> >>>>> >>>>> On 08/18/2015 11:39 AM, Rex X wrote: >>>>> >>>>> Hi everyone, >>>>> >>>>> Is it possible to extract the decision tree rule associated with each >>>>> leaf node into a Pandas Dataframe query >>>>> <http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.query.html>? >>>>> So that one can view the corresponding Dataframe content by feeding in the >>>>> decision rule. >>>>> >>>>> Best, >>>>> Rex >>>>> >>>>> >>>>> ------------------------------------------------------------------------------ >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> Scikit-learn-general mailing >>>>> listScikit-learn-general@lists.sourceforge.nethttps://lists.sourceforge.net/lists/listinfo/scikit-learn-general >>>>> >>>>> >>>>> >>>>> >>>>> ------------------------------------------------------------------------------ >>>>> >>>>> _______________________________________________ >>>>> Scikit-learn-general mailing list >>>>> Scikit-learn-general@lists.sourceforge.net >>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general >>>>> >>>>> >>>> >>>> >>>> ------------------------------------------------------------------------------ >>>> >>>> _______________________________________________ >>>> Scikit-learn-general mailing list >>>> Scikit-learn-general@lists.sourceforge.net >>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general >>>> >>>> >>> >>> >>> ------------------------------------------------------------------------------ >>> >>> _______________________________________________ >>> Scikit-learn-general mailing list >>> Scikit-learn-general@lists.sourceforge.net >>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general >>> >>> >> >> >> ------------------------------------------------------------------------------ >> >> _______________________________________________ >> Scikit-learn-general mailing list >> Scikit-learn-general@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general >> >> > > > ------------------------------------------------------------------------------ > > _______________________________________________ > Scikit-learn-general mailing list > Scikit-learn-general@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general > >
------------------------------------------------------------------------------
_______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general