[
https://issues.apache.org/jira/browse/SPARK-7685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
DB Tsai updated SPARK-7685:
---------------------------
Summary: Handle high imbalanced data and apply weights to different samples
in Logistic Regression (was: Handle high imbalanced data or apply weights to
different samples in Logistic Regression)
> Handle high imbalanced data and apply weights to different samples in
> Logistic Regression
> -----------------------------------------------------------------------------------------
>
> Key: SPARK-7685
> URL: https://issues.apache.org/jira/browse/SPARK-7685
> Project: Spark
> Issue Type: New Feature
> Components: ML
> Reporter: DB Tsai
>
> In fraud detection dataset, almost all the samples are negative while only
> couple of them are positive. This type of high imbalanced data will bias the
> models toward negative resulting poor performance. In python-scikit, they
> provide a correction allowing users to Over-/undersample the samples of each
> class according to the given weights. In auto mode, selects weights inversely
> proportional to class frequencies in the training set. This can be done in a
> more efficient way by multiplying the weights into loss and gradient instead
> of doing actual over/undersampling in the training dataset which is very
> expensive.
> http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
> On the other hand, some of the training data maybe more important like the
> training samples from tenure users while the training samples from new users
> maybe less important. We should be able to provide another "weight: Double"
> information in the LabeledPoint to weight them differently in the learning
> algorithm.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]