Re: [Scikit-learn-general] what is the right strategy to set up decisionTree for this fraud detection problem?

Luca Puggini Fri, 14 Aug 2015 02:00:07 -0700

I do not know if this may help you.

I think that if you have to construct a single decision tree it would be
better to use something like the
https://cran.r-project.org/web/packages/party/party.pdf


here for each split a statistical test is performed and this should make
the model more robust and easier to interpret.

I do not know if there is something similar here on sklearn.

Best,
Luca

On Fri, Aug 14, 2015 at 10:26 AM, Rex X <dnsr...@gmail.com> wrote:

> The data sets are online transactions. For each one, we label it as
> "fraud" or "good". This is a binary classification problem. With
> decisionTree, we can identify those combined conditions that are likely to
> trigger a "fraud". I am willing to hear advice.
>
> The features include:
> transaction amount, time stamp, product_category, risk_score, city,
> country, and fraud_flag.
>
> Most transactions are "good", say, we have 1 million transactions in
> total, and only 1 thousand are detected as "fraud".
>
> We want to find out the optimal threshold values of "risk_score"
> corresponding to each top compromised cities and/or product_categories,
> which are clusters of fraud transactions. We want to minimize the fraud
> rate, and maximize the total sales volume.
>
> We are most interested to find out the decision rules leading to clusters
> of leaf node with
> fraud rate= fraud_sales/total_sales >= 20%
>
> I am looking at DecisionTreeClassifier
> <http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html>
> :
>
> http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html
>
> Because we want to extract rules, it is not feasible to build a
> complicated decisionTree. I set up max_depth=4.
>
> What is the right strategy to set up the class_weight?
>
>> *class_weight* : dict, list of dicts, “auto” or None, optional
>> (default=None)
>>
>> Weights associated with classes in the form {class_label: weight}... For 
>> *multi-output
>> *problems, a list of dicts can be provided in the same order as the
>> columns of y.
>>
> I want to output in each leaf node with both
>
> [number of fraud, number of good transactions], and [fraud sales volume,
> good sales volume]
>
> Should I use list of dicts for class_weight? e.g.
>
> class_weight=[{0:1, 1:1}, {0:some_weight_need_to_be figured_out,
> 1:}other_weight]
>
>
> Any tips are greatly welcome!
>
>
> Best regards,
> Rex
>
>
>
> ------------------------------------------------------------------------------
>
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>

------------------------------------------------------------------------------

_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] what is the right strategy to set up decisionTree for this fraud detection problem?

Reply via email to