GitHub user imatiach-msft opened a pull request:

    https://github.com/apache/spark/pull/21632

    [SPARK-19591][ML][MLlib] Add sample weights to decision trees

    This is updated PR https://github.com/apache/spark/pull/16722
    
    What changes were proposed in this pull request?
    
    This patch adds support for sample weights to DecisionTreeRegressor and 
DecisionTreeClassifier.
    
    Note: This patch does not add support for sample weights to RandomForest. 
As discussed in the JIRA, we would like to add sample weights into the bagging 
process. This patch is large enough as is, and there are some additional 
considerations to be made for random forests. Since the machinery introduced 
here needs to be present regardless, I have opted to leave random forests for a 
follow up pr.
    How was this patch tested?
    
    The algorithms are tested to ensure that:
    
        Arbitrary scaling of constant weights has no effect
        Outliers with small weights do not affect the learned model
        Oversampling and weighting are equivalent
    
    Unit tests are also added to test other smaller components.
    Summary of changes
    
        Impurity aggregators now store weighted sufficient statistics. They 
also store a raw count, however, since this is needed to use 
minInstancesPerNode.
    
        Impurity aggregators now also hold the raw count.
    
        This patch maintains the meaning of minInstancesPerNode, in that the 
parameter still corresponds to raw, unweighted counts. It also adds a new 
parameter minWeightFractionPerNode which requires that nodes must contain at 
least minWeightFractionPerNode * weightedNumExamples total weight.
    
        This patch modifies findSplitsForContinuousFeatures to use weighted 
sums. Unit tests are added.
    
        TreePoint is modified to hold a sample weight
    
        BaggedPoint is modified from:
    
    private[spark] class BaggedPoint[Datum](val datum: Datum, val 
subsampleWeights: Array[Double]) extends Serializable
    
    to
    
    private[spark] class BaggedPoint[Datum](
        val datum: Datum,
        val subsampleCounts: Array[Int],
        val sampleWeight: Double) extends Serializable
    
    We do not simply multiply the counts by the weight and store that because 
we need the raw counts and the weight in order to use both minInstancesPerNode 
and minWeightPerNode
    
    Note: many of the changed files are due simply to using Instance instead of 
LabeledPoint

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/imatiach-msft/spark ilmat/sample-weights

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/21632.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #21632
    
----
commit b5278e5a54156c14b6a8bdd3256f18e1ff3b4128
Author: Ilya Matiach <ilmat@...>
Date:   2017-01-27T16:38:36Z

    add weights to dt

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to