[
https://issues.apache.org/jira/browse/SPARK-7685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14549667#comment-14549667
]
Joseph K. Bradley commented on SPARK-7685:
------------------------------------------
+1
This should probably be done in the Pipelines API, where we can add a "weight:
Double" column (HasWeight shared parameter) instead of modifying LabeledPoint.
LogisticRegression is a natural place to start supporting weights, but I hope
we can add weight support almost everywhere before too long.
> Handle high imbalanced data and apply weights to different samples in
> Logistic Regression
> -----------------------------------------------------------------------------------------
>
> Key: SPARK-7685
> URL: https://issues.apache.org/jira/browse/SPARK-7685
> Project: Spark
> Issue Type: New Feature
> Components: ML
> Reporter: DB Tsai
>
> In fraud detection dataset, almost all the samples are negative while only
> couple of them are positive. This type of high imbalanced data will bias the
> models toward negative resulting poor performance. In python-scikit, they
> provide a correction allowing users to Over-/undersample the samples of each
> class according to the given weights. In auto mode, selects weights inversely
> proportional to class frequencies in the training set. This can be done in a
> more efficient way by multiplying the weights into loss and gradient instead
> of doing actual over/undersampling in the training dataset which is very
> expensive.
> http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
> On the other hand, some of the training data maybe more important like the
> training samples from tenure users while the training samples from new users
> maybe less important. We should be able to provide another "weight: Double"
> information in the LabeledPoint to weight them differently in the learning
> algorithm.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]