Hi Jacob, Let's consider one leaf node with three order transactions, one order is good ($30), and the other two are fraud ($35 + $35 = $70 fraud in total).
The two class_weights are in equal weight, {'0':1, '1':1}, in which class '0' labels good, and the class '1' labels a fraud. The two classes are imbalanced, say, we have $100 fraud (from 10 orders), and $900 good transactions (from 1000 order). The sample_weight <http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html> is the normalized order amount. Assuming this superstore has a total sales of $1000. That means, this leaf node, has a value of [0.03, 0.07], and let's denote this value by [good_usd_leaf, *fraud_usd_leaf*]. The fraud rate of this leaf node is fraud_rate_store=7% for the whole superstore, $70/$1000=7%, and *f**raud_rate_leaf*=70% in the context of this leaf node only, $70/($30+$70) = 70%. If we make a rule to decline orders based on the decision rules leading to this leaf node, we can save $70 bucks, or 7% for this superstore sales. With decision tree, it is very easy to find decision rules leading to good orders, since we have so many samples. However, we have no interest with those good transactions. We want to find out all leaf nodes with *fraud_usd_leaf *>= 0.05, and *fraud_rate_leaf* >= 30%. Any ideas what strategies can be applied for this fraud detection problem? Best, Rex On Thu, Aug 20, 2015 at 2:09 AM, Jacob Schreiber <jmschreibe...@gmail.com> wrote: > It sounds like you prefer false negatives over false positives (not > catching bad activity, but rarely misclassifying good activity as bad > activity). You can weight the different classes currently by setting the > sample weight on good activity points to be higher than those of bad > activity points. The classifier will automatically try to fit the good > activity better. > > If I misinterpreted what you asked, let me know. I wasn't exactly sure > what your first sentence meant. > > On Thu, Aug 20, 2015 at 10:53 AM, Rex X <dnsr...@gmail.com> wrote: > >> Very nice! Thanks to both of you, Jacob and Andreas! >> >> Andreas, yes, I'm interested in all leafs. The additional Pandas query >> done on each leaf node is a further check to inspect whether this leaf node >> can be of interest or not. >> >> Binary classification for example, fraud detection to be more specific, >> we can set up sample_weight based on the order amount in the first run, to >> find those large-sale segments. But we don't want a rule that decline too >> many good orders in the meantime. That means, we want to minimize both the >> number and amount of good samples in each leaf node. This job seems cannot >> be done in the current decisionTree implementation. That's why we need to >> fire a further Pandas query to do this job. >> >> Any comments? >> >> Best, >> Rex >> >> >> On Tue, Aug 18, 2015 at 9:23 AM, Jacob Schreiber <jmschreibe...@gmail.com >> > wrote: >> >>> There is no code to do it automatically, but you can use the following >>> to get array of thresholds: >>> >>> ``` >>> clf = DecisionTreeClassifier() >>> clf.fit( datums, targets ) >>> clf.tree_.thresholds >>> ``` >>> >>> The full list of attributes you can call are (feature, threshold, >>> impurity, n_node_samples, weighted_n_node_samples, value (the prediction), >>> children_left, children_right). >>> >>> Does this help? >>> >>> >>> >>> On Tue, Aug 18, 2015 at 7:53 AM, Andreas Mueller <t3k...@gmail.com> >>> wrote: >>> >>>> I'm not aware of any ready-made code. But you can just get the boolean >>>> matrix by using ``apply`` and a one-hot encoder. >>>> Why are you interested in a single leave? the query seems to be able to >>>> return "only" a single boolean. >>>> It is probably more efficient to traverse the full tree for each data >>>> point if you are interested in all the leafs. >>>> >>>> >>>> On 08/18/2015 11:39 AM, Rex X wrote: >>>> >>>> Hi everyone, >>>> >>>> Is it possible to extract the decision tree rule associated with each >>>> leaf node into a Pandas Dataframe query >>>> <http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.query.html>? >>>> So that one can view the corresponding Dataframe content by feeding in the >>>> decision rule. >>>> >>>> Best, >>>> Rex >>>> >>>> >>>> ------------------------------------------------------------------------------ >>>> >>>> >>>> >>>> _______________________________________________ >>>> Scikit-learn-general mailing >>>> listScikit-learn-general@lists.sourceforge.nethttps://lists.sourceforge.net/lists/listinfo/scikit-learn-general >>>> >>>> >>>> >>>> >>>> ------------------------------------------------------------------------------ >>>> >>>> _______________________________________________ >>>> Scikit-learn-general mailing list >>>> Scikit-learn-general@lists.sourceforge.net >>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general >>>> >>>> >>> >>> >>> ------------------------------------------------------------------------------ >>> >>> _______________________________________________ >>> Scikit-learn-general mailing list >>> Scikit-learn-general@lists.sourceforge.net >>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general >>> >>> >> >> >> ------------------------------------------------------------------------------ >> >> _______________________________________________ >> Scikit-learn-general mailing list >> Scikit-learn-general@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general >> >> > > > ------------------------------------------------------------------------------ > > _______________________________________________ > Scikit-learn-general mailing list > Scikit-learn-general@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general > >
------------------------------------------------------------------------------
_______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general