GitHub user sethah opened a pull request:
https://github.com/apache/spark/pull/15721
[SPARK-17772][ML][TEST] Add test functions for ML sample weights
## What changes were proposed in this pull request?
More and more ML algos are accepting sample weights, and they have been
tested rather heterogeneously and with code duplication. This patch adds
extensible helper methods to `MLTestingUtils` that can be reused by various
algorithms accepting sample weights. Up to now, there seems to be a few tests
that have been implemented commonly:
* Check that oversampling is the same as giving the instances sample
weights proportional to the number of samples
* Check that outliers with tiny sample weights do not affect the
algorithm's performance
This patch adds an additional test:
* Check that algorithms are invariant to constant scaling of the sample
weights. i.e. uniform sample weights with `w_i = 1.0` is effectively the same
as uniform sample weights with `w_i = 10000` or `w_i = 0.0001`
The instances of these tests occurred in LinearRegression, NaiveBayes, and
LogisticRegression. Those tests have been removed/modified to use the new
helper methods. These helper functions will be of use when
[SPARK-9478](https://issues.apache.org/jira/browse/SPARK-9478) is implemented.
## How was this patch tested?
This patch only involves modifying test suites.
## Other notes
Both IsotonicRegression and GeneralizedLinearRegression also extend
`HasWeightCol`. I did not modify these test suites because it will make this
patch easier to review, and because they did not duplicate the same tests as
the three suites that were modified. If we want to change them later, we can
create a JIRA for it now, but it's open for debate.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/sethah/spark SPARK-17772
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/15721.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #15721
----
commit e10be455ee943230a96e57370b718683647e6f03
Author: sethah <[email protected]>
Date: 2016-10-18T21:27:02Z
add sample weight helper tests
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]