In the meanwhile, when finding out those decision rules giving fraud_rate= fraud_sales/total_sales >= 20%, we want to minimize the number of affected transactions by declining transactions based on the decision rules discovered above.
Luca, thanks for the information. I will look into Party and R. But I prefer a Python package if possible. Best, Rex On Fri, Aug 14, 2015 at 1:58 AM, Luca Puggini <lucapug...@gmail.com> wrote: > I do not know if this may help you. > > I think that if you have to construct a single decision tree it would be > better to use something like the > https://cran.r-project.org/web/packages/party/party.pdf > > here for each split a statistical test is performed and this should make > the model more robust and easier to interpret. > > I do not know if there is something similar here on sklearn. > > Best, > Luca > > On Fri, Aug 14, 2015 at 10:26 AM, Rex X <dnsr...@gmail.com> wrote: > >> The data sets are online transactions. For each one, we label it as >> "fraud" or "good". This is a binary classification problem. With >> decisionTree, we can identify those combined conditions that are likely to >> trigger a "fraud". I am willing to hear advice. >> >> The features include: >> transaction amount, time stamp, product_category, risk_score, city, >> country, and fraud_flag. >> >> Most transactions are "good", say, we have 1 million transactions in >> total, and only 1 thousand are detected as "fraud". >> >> We want to find out the optimal threshold values of "risk_score" >> corresponding to each top compromised cities and/or product_categories, >> which are clusters of fraud transactions. We want to minimize the fraud >> rate, and maximize the total sales volume. >> >> We are most interested to find out the decision rules leading to clusters >> of leaf node with >> fraud rate= fraud_sales/total_sales >= 20% >> >> I am looking at DecisionTreeClassifier >> <http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html> >> : >> >> http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html >> >> Because we want to extract rules, it is not feasible to build a >> complicated decisionTree. I set up max_depth=4. >> >> What is the right strategy to set up the class_weight? >> >>> *class_weight* : dict, list of dicts, “auto” or None, optional >>> (default=None) >>> >>> Weights associated with classes in the form {class_label: weight}... >>> For *multi-output *problems, a list of dicts can be provided in the >>> same order as the columns of y. >>> >> I want to output in each leaf node with both >> >> [number of fraud, number of good transactions], and [fraud sales volume, >> good sales volume] >> >> Should I use list of dicts for class_weight? e.g. >> >> class_weight=[{0:1, 1:1}, {0:some_weight_need_to_be figured_out, >> 1:}other_weight] >> >> >> Any tips are greatly welcome! >> >> >> Best regards, >> Rex >> >> >> >> ------------------------------------------------------------------------------ >> >> _______________________________________________ >> Scikit-learn-general mailing list >> Scikit-learn-general@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general >> >> > > > ------------------------------------------------------------------------------ > > _______________________________________________ > Scikit-learn-general mailing list > Scikit-learn-general@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general > >
------------------------------------------------------------------------------
_______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general