I do not know if this may help you. I think that if you have to construct a single decision tree it would be better to use something like the https://cran.r-project.org/web/packages/party/party.pdf
here for each split a statistical test is performed and this should make the model more robust and easier to interpret. I do not know if there is something similar here on sklearn. Best, Luca On Fri, Aug 14, 2015 at 10:26 AM, Rex X <dnsr...@gmail.com> wrote: > The data sets are online transactions. For each one, we label it as > "fraud" or "good". This is a binary classification problem. With > decisionTree, we can identify those combined conditions that are likely to > trigger a "fraud". I am willing to hear advice. > > The features include: > transaction amount, time stamp, product_category, risk_score, city, > country, and fraud_flag. > > Most transactions are "good", say, we have 1 million transactions in > total, and only 1 thousand are detected as "fraud". > > We want to find out the optimal threshold values of "risk_score" > corresponding to each top compromised cities and/or product_categories, > which are clusters of fraud transactions. We want to minimize the fraud > rate, and maximize the total sales volume. > > We are most interested to find out the decision rules leading to clusters > of leaf node with > fraud rate= fraud_sales/total_sales >= 20% > > I am looking at DecisionTreeClassifier > <http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html> > : > > http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html > > Because we want to extract rules, it is not feasible to build a > complicated decisionTree. I set up max_depth=4. > > What is the right strategy to set up the class_weight? > >> *class_weight* : dict, list of dicts, “auto” or None, optional >> (default=None) >> >> Weights associated with classes in the form {class_label: weight}... For >> *multi-output >> *problems, a list of dicts can be provided in the same order as the >> columns of y. >> > I want to output in each leaf node with both > > [number of fraud, number of good transactions], and [fraud sales volume, > good sales volume] > > Should I use list of dicts for class_weight? e.g. > > class_weight=[{0:1, 1:1}, {0:some_weight_need_to_be figured_out, > 1:}other_weight] > > > Any tips are greatly welcome! > > > Best regards, > Rex > > > > ------------------------------------------------------------------------------ > > _______________________________________________ > Scikit-learn-general mailing list > Scikit-learn-general@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general > >
------------------------------------------------------------------------------
_______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general