zhengruifeng created SPARK-17057:
------------------------------------

             Summary: ProbabilisticClassifierModels' prediction more reasonable 
with multi zero thresholds
                 Key: SPARK-17057
                 URL: https://issues.apache.org/jira/browse/SPARK-17057
             Project: Spark
          Issue Type: Improvement
          Components: ML
            Reporter: zhengruifeng


{code}
val path = "./data/mllib/sample_multiclass_classification_data.txt"
val data = spark.read.format("libsvm").load(path)
val rfm = rf.fit(data)

scala> rfm.setThresholds(Array(0.0,0.0,0.0))
res4: org.apache.spark.ml.classification.RandomForestClassificationModel = 
RandomForestClassificationModel (uid=rfc_cbe640b0eccc) with 20 trees

scala> rfm.transform(data).show(5)
+-----+--------------------+--------------+-------------+----------+
|label|            features| rawPrediction|  probability|prediction|
+-----+--------------------+--------------+-------------+----------+
|  1.0|(4,[0,1,2,3],[-0....|[0.0,20.0,0.0]|[0.0,1.0,0.0]|       0.0|
|  1.0|(4,[0,1,2,3],[-0....|[0.0,20.0,0.0]|[0.0,1.0,0.0]|       0.0|
|  1.0|(4,[0,1,2,3],[-0....|[0.0,20.0,0.0]|[0.0,1.0,0.0]|       0.0|
|  1.0|(4,[0,1,2,3],[-0....|[0.0,20.0,0.0]|[0.0,1.0,0.0]|       0.0|
|  0.0|(4,[0,1,2,3],[0.1...|[20.0,0.0,0.0]|[1.0,0.0,0.0]|       0.0|
+-----+--------------------+--------------+-------------+----------+
only showing top 5 rows
{code}

If multi thresholds are set zero, the prediction of 
{{ProbabilisticClassificationModel}} is the first index whose corresponding 
threshold is 0. 
However, in this case, the index with max {{probability}} among indices with 
0-threshold should be more reasonable to mark as
{{prediction}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to