GitHub user dbtsai opened a pull request:

    https://github.com/apache/spark/pull/7884

    [SPARK-7685][ML] Apply weights to different samples in Logistic Regression

    In fraud detection dataset, almost all the samples are negative while only 
couple of them are positive. This type of high imbalanced data will bias the 
models toward negative resulting poor performance. In python-scikit, they 
provide a correction allowing users to Over-/undersample the samples of each 
class according to the given weights. In auto mode, selects weights inversely 
proportional to class frequencies in the training set. This can be done in a 
more efficient way by multiplying the weights into loss and gradient instead of 
doing actual over/undersampling in the training dataset which is very expensive.
    
http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
    On the other hand, some of the training data maybe more important like the 
training samples from tenure users while the training samples from new users 
maybe less important. We should be able to provide another "weight: Double" 
information in the LabeledPoint to weight them differently in the learning 
algorithm.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/dbtsai/spark SPARK-7685

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/7884.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #7884
    
----
commit e83b44464b85732aa83934922e23d7529c2f743e
Author: DB Tsai <[email protected]>
Date:   2015-08-03T04:37:24Z

    first commit

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to