DB Tsai created SPARK-7685:
------------------------------
Summary: Handle high imbalanced data or apply weights to different
samples in Logistic Regression
Key: SPARK-7685
URL: https://issues.apache.org/jira/browse/SPARK-7685
Project: Spark
Issue Type: New Feature
Components: ML
Reporter: DB Tsai
In fraud detection dataset, almost all the samples are negative while only
couple of them are positive. This type of high imbalanced data will bias the
models toward negative resulting poor performance. In python-scikit, they
provide a correction allowing users to Over-/undersample the samples of each
class according to the given weights. In auto mode, selects weights inversely
proportional to class frequencies in the training set. This can be done in a
more efficient way by multiplying the weights into loss and gradient instead of
doing actual over/undersampling in the training dataset which is very expensive.
http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
On the other hand, some of the training data maybe more important like the
training samples from tenure users while the training samples from new users
maybe less important. We should be able to provide another "weight: Double"
information in the LabeledPoint to weight them differently in the learning
algorithm.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]