[GitHub] imatiach-msft opened a new pull request #23682: [SPARK-19591][ML][MLlib][FOLLOWUP] Add sample weights to decision trees - fix tolerance

GitBox Mon, 28 Jan 2019 20:32:38 -0800

imatiach-msft opened a new pull request #23682: 
[SPARK-19591][ML][MLlib][FOLLOWUP] Add sample weights to decision trees - fix 
tolerance
URL: https://github.com/apache/spark/pull/23682
 
 
   This is a follow-up to PR:
   https://github.com/apache/spark/pull/21632
   
   ## What changes were proposed in this pull request?
   
   This PR tunes the tolerance used for deciding whether to add zero feature 
values to a value-count map (where the key is the feature value and the value 
is the weighted count of those feature values).
   In the previous PR the tolerance scaled by the square of the unweighted 
number of samples, which is too aggressive for a large number of unweighted 
samples.  Unfortunately using just "Utils.EPSILON * unweightedNumSamples" is 
not enough either, so I multiplied that by a factor tuned by the testing 
procedure below.
   
   ## How was this patch tested?
   
   This involved manually running the sample weight tests for decision tree 
regressor to see whether the tolerance was large enough to exclude zero feature 
values.
   
   Eg in SBT:
   ./build/sbt
   > project mllib
   > testOnly *DecisionTreeRegressorSuite -- -z "training with sample weights"
   
   For validation, I added a print inside the if in the code below and 
validated that the tolerance was large enough so that we would not include zero 
features (which don't exist in that test):
         val valueCountMap = if (weightedNumSamples - partNumSamples > 
tolerance) {
           print("should not print this")
           partValueCountMap + (0.0 -> (weightedNumSamples - partNumSamples))
         } else {
           partValueCountMap
         }


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] imatiach-msft opened a new pull request #23682: [SPARK-19591][ML][MLlib][FOLLOWUP] Add sample weights to decision trees - fix tolerance

Reply via email to