GitHub user imatiach-msft opened a pull request:
https://github.com/apache/spark/pull/21632
[SPARK-19591][ML][MLlib] Add sample weights to decision trees
This is updated PR https://github.com/apache/spark/pull/16722
What changes were proposed in this pull request?
This patch adds support for sample weights to DecisionTreeRegressor and
DecisionTreeClassifier.
Note: This patch does not add support for sample weights to RandomForest.
As discussed in the JIRA, we would like to add sample weights into the bagging
process. This patch is large enough as is, and there are some additional
considerations to be made for random forests. Since the machinery introduced
here needs to be present regardless, I have opted to leave random forests for a
follow up pr.
How was this patch tested?
The algorithms are tested to ensure that:
Arbitrary scaling of constant weights has no effect
Outliers with small weights do not affect the learned model
Oversampling and weighting are equivalent
Unit tests are also added to test other smaller components.
Summary of changes
Impurity aggregators now store weighted sufficient statistics. They
also store a raw count, however, since this is needed to use
minInstancesPerNode.
Impurity aggregators now also hold the raw count.
This patch maintains the meaning of minInstancesPerNode, in that the
parameter still corresponds to raw, unweighted counts. It also adds a new
parameter minWeightFractionPerNode which requires that nodes must contain at
least minWeightFractionPerNode * weightedNumExamples total weight.
This patch modifies findSplitsForContinuousFeatures to use weighted
sums. Unit tests are added.
TreePoint is modified to hold a sample weight
BaggedPoint is modified from:
private[spark] class BaggedPoint[Datum](val datum: Datum, val
subsampleWeights: Array[Double]) extends Serializable
to
private[spark] class BaggedPoint[Datum](
val datum: Datum,
val subsampleCounts: Array[Int],
val sampleWeight: Double) extends Serializable
We do not simply multiply the counts by the weight and store that because
we need the raw counts and the weight in order to use both minInstancesPerNode
and minWeightPerNode
Note: many of the changed files are due simply to using Instance instead of
LabeledPoint
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/imatiach-msft/spark ilmat/sample-weights
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/21632.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #21632
----
commit b5278e5a54156c14b6a8bdd3256f18e1ff3b4128
Author: Ilya Matiach <ilmat@...>
Date: 2017-01-27T16:38:36Z
add weights to dt
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]