Re: [Scikit-learn-general] what is the right strategy to set up decisionTree for this fraud detection problem?

Rex X Fri, 14 Aug 2015 10:17:50 -0700

In the meanwhile, when finding out those decision rules giving fraud_rate=
fraud_sales/total_sales >= 20%, we want to minimize the number of affected
transactions by declining transactions based on the decision rules
discovered above.



Luca, thanks for the information. I will look into Party and R. But I
prefer a Python package if possible.

Best,
Rex


On Fri, Aug 14, 2015 at 1:58 AM, Luca Puggini <lucapug...@gmail.com> wrote:

> I do not know if this may help you.
>
> I think that if you have to construct a single decision tree it would be
> better to use something like the
> https://cran.r-project.org/web/packages/party/party.pdf
>
> here for each split a statistical test is performed and this should make
> the model more robust and easier to interpret.
>
> I do not know if there is something similar here on sklearn.
>
> Best,
> Luca
>
> On Fri, Aug 14, 2015 at 10:26 AM, Rex X <dnsr...@gmail.com> wrote:
>
>> The data sets are online transactions. For each one, we label it as
>> "fraud" or "good". This is a binary classification problem. With
>> decisionTree, we can identify those combined conditions that are likely to
>> trigger a "fraud". I am willing to hear advice.
>>
>> The features include:
>> transaction amount, time stamp, product_category, risk_score, city,
>> country, and fraud_flag.
>>
>> Most transactions are "good", say, we have 1 million transactions in
>> total, and only 1 thousand are detected as "fraud".
>>
>> We want to find out the optimal threshold values of "risk_score"
>> corresponding to each top compromised cities and/or product_categories,
>> which are clusters of fraud transactions. We want to minimize the fraud
>> rate, and maximize the total sales volume.
>>
>> We are most interested to find out the decision rules leading to clusters
>> of leaf node with
>> fraud rate= fraud_sales/total_sales >= 20%
>>
>> I am looking at DecisionTreeClassifier
>> <http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html>
>> :
>>
>> http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html
>>
>> Because we want to extract rules, it is not feasible to build a
>> complicated decisionTree. I set up max_depth=4.
>>
>> What is the right strategy to set up the class_weight?
>>
>>> *class_weight* : dict, list of dicts, “auto” or None, optional
>>> (default=None)
>>>
>>> Weights associated with classes in the form {class_label: weight}...
>>> For *multi-output *problems, a list of dicts can be provided in the
>>> same order as the columns of y.
>>>
>> I want to output in each leaf node with both
>>
>> [number of fraud, number of good transactions], and [fraud sales volume,
>> good sales volume]
>>
>> Should I use list of dicts for class_weight? e.g.
>>
>> class_weight=[{0:1, 1:1}, {0:some_weight_need_to_be figured_out,
>> 1:}other_weight]
>>
>>
>> Any tips are greatly welcome!
>>
>>
>> Best regards,
>> Rex
>>
>>
>>
>> ------------------------------------------------------------------------------
>>
>> _______________________________________________
>> Scikit-learn-general mailing list
>> Scikit-learn-general@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>>
>
>
> ------------------------------------------------------------------------------
>
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>

------------------------------------------------------------------------------

_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] what is the right strategy to set up decisionTree for this fraud detection problem?

Reply via email to