[GitHub] spark issue #14949: [SPARK-17057] [ML] ProbabilisticClassifierModels' predic...

MLnick Mon, 12 Sep 2016 05:35:47 -0700

Github user MLnick commented on the issue:

    https://github.com/apache/spark/pull/14949
  
    The original JIRA 
[SPARK-8069](https://issues.apache.org/jira/browse/SPARK-8069) refers to 
https://cran.r-project.org/web/packages/randomForest/randomForest.pdf. 
    
    That R package calls it "cutoff". Though it does indeed seem to act more 
like a "weight" or "scaling". I can't say I've come across it before, and it 
appears this is the only package that does it like this (at least that I've 
been able to find from some quick searching). I haven't found any theoretical 
background for it either.
    
    In any case, now that we have it, I think it probably best to keep it as 
is. However, It appears that our implementation here is flawed since in the 
original R code, the `cutoff` vector sum must be in (0, 1) (and also be >0 
everywhere) - see 
https://github.com/cran/randomForest/blob/9208176df98d561aba6dae239472be8b124e2631/R/predict.randomForest.R#L47.
 If we're going to base something on another impl, probably best to actually 
follow it.
    
    So:
    * If `sum(thresholds)` > 1 or < 0, throw and error
    * If each entry in `thresholds` not > 0, throw an error
    
    I believe this takes care of the edge cases since no thresholds can be `0` 
or `1`. The tie breaker element is taken care of with `Vector.argmax` (if p/t 
is the same for 2 or more classes, then ties will effectively be broken by 
class index order).
    
    I don't like returning `NaN`. Since the R impl is actually scaling things 
rather than actually "cutting off" or "thresholding", it should always return a 
prediction and I think we should too.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #14949: [SPARK-17057] [ML] ProbabilisticClassifierModels' predic...

Reply via email to