[ 
https://issues.apache.org/jira/browse/SPARK-17057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15502929#comment-15502929
 ] 

Apache Spark commented on SPARK-17057:
--------------------------------------

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/15149

> ProbabilisticClassifierModels' thresholds should be > 0 and sum < 1 to match 
> randomForest cutoff
> ------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-17057
>                 URL: https://issues.apache.org/jira/browse/SPARK-17057
>             Project: Spark
>          Issue Type: Bug
>          Components: ML
>    Affects Versions: 2.0.0
>            Reporter: zhengruifeng
>            Assignee: Sean Owen
>            Priority: Minor
>
> {code}
> val path = "./data/mllib/sample_multiclass_classification_data.txt"
> val data = spark.read.format("libsvm").load(path)
> val rfm = rf.fit(data)
> scala> rfm.setThresholds(Array(0.0,0.0,0.0))
> res4: org.apache.spark.ml.classification.RandomForestClassificationModel = 
> RandomForestClassificationModel (uid=rfc_cbe640b0eccc) with 20 trees
> scala> rfm.transform(data).show(5)
> +-----+--------------------+--------------+-------------+----------+
> |label|            features| rawPrediction|  probability|prediction|
> +-----+--------------------+--------------+-------------+----------+
> |  1.0|(4,[0,1,2,3],[-0....|[0.0,20.0,0.0]|[0.0,1.0,0.0]|       0.0|
> |  1.0|(4,[0,1,2,3],[-0....|[0.0,20.0,0.0]|[0.0,1.0,0.0]|       0.0|
> |  1.0|(4,[0,1,2,3],[-0....|[0.0,20.0,0.0]|[0.0,1.0,0.0]|       0.0|
> |  1.0|(4,[0,1,2,3],[-0....|[0.0,20.0,0.0]|[0.0,1.0,0.0]|       0.0|
> |  0.0|(4,[0,1,2,3],[0.1...|[20.0,0.0,0.0]|[1.0,0.0,0.0]|       0.0|
> +-----+--------------------+--------------+-------------+----------+
> only showing top 5 rows
> {code}
> If multi thresholds are set zero, the prediction of 
> {{ProbabilisticClassificationModel}} is the first index whose corresponding 
> threshold is 0. 
> However, in this case, the index with max {{probability}} among indices with 
> 0-threshold should be more reasonable to mark as
> {{prediction}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to