[
https://issues.apache.org/jira/browse/SPARK-45910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17863526#comment-17863526
]
psyren99 commented on SPARK-45910:
----------------------------------
Would like to take on this issue if there is no one else working on it.
> Numerical output of MulticlassClassificationEvaluator does not coincide with
> expected output
> --------------------------------------------------------------------------------------------
>
> Key: SPARK-45910
> URL: https://issues.apache.org/jira/browse/SPARK-45910
> Project: Spark
> Issue Type: Bug
> Components: ML
> Affects Versions: 3.4.1, 3.5.0
> Reporter: Alex Wozniakowski
> Priority: Critical
> Attachments: predictions_dot_show.png
>
>
> To show an example of MulticlassClassificationEvaluator generating a
> numerical output, which does not coincide with the expected output consider
> the following code:
> {code:java}
> from pyspark.ml.classification import LinearSVC
> from pyspark.ml.feature import VectorAssembler
> from pyspark.ml.evaluation import MulticlassClassificationEvaluator
> train_data = [(0, 1.0, 2.0, 3.0), (1, 4.0, 5.0, 6.0), (0, 7.0, 8.0, 9.0)]
> valid_data = [(1, 2.0, 3.0, 4.0), (0, 5.0, 6.0, 7.0), (1, 8.0, 9.0, 10.0)]
> schema = ["label", "feature1", "feature2", "feature3"]
> train = spark.createDataFrame(train_data, schema=schema)
> valid = spark.createDataFrame(valid_data, schema=schema)
> feature_columns = ["feature1", "feature2", "feature3"]
> assembler = VectorAssembler(inputCols=feature_columns, outputCol="features")
> train = assembler.transform(train)
> valid = assembler.transform(valid)
> svm = LinearSVC(maxIter=10, regParam=0.1)
> model = svm.fit(train)
> predictions = model.transform(valid)
> recallByLabel = MulticlassClassificationEvaluator(metricName="recallByLabel")
> weightedRecall =
> MulticlassClassificationEvaluator(metricName="weightedRecall")
> print(f"Recall by label: {recallByLabel.evaluate(predictions)}")
> print(f"Weighted recall: {weightedRecall.evaluate(predictions)}") {code}
> It produces:
> {code:java}
> Recall by label: 1.0
> Weighted recall: 0.3333333333333333{code}
> but predictions.show() implies the following hand calculated confusion matrix:
> {code:java}
> -----------
> | 0 | 0 |
> | 2 | 1 |
> -----------{code}
> where the recall is 0, i.e., 0 / (0 + 2).
> What is the nature of this discrepancy? Also, note that it is not restricted
> to recall; and other classifiers, which include a probability column in
> predictions, behave similarly.
>
> Furthermore, the translation of the example to Scala, namely:
> {code:java}
> import org.apache.spark.ml.classification.LinearSVC
> import org.apache.spark.ml.feature.VectorAssembler
> import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
> import org.apache.spark.sql.DataFrame
> val trainData = Seq((0, 1.0, 2.0, 3.0), (1, 4.0, 5.0, 6.0), (0, 7.0, 8.0,
> 9.0))
> val validData = Seq((1, 2.0, 3.0, 4.0), (0, 5.0, 6.0, 7.0), (1, 8.0, 9.0,
> 10.0))
> val schema = Seq("label", "feature1", "feature2", "feature3")
> val train: DataFrame = spark.createDataFrame(trainData).toDF(schema: _*)
> val valid: DataFrame = spark.createDataFrame(validData).toDF(schema: _*)
> val featureColumns = Array("feature1", "feature2", "feature3")
> val assembler = new VectorAssembler()
> .setInputCols(featureColumns)
> .setOutputCol("features")
> val trainAssembled = assembler.transform(train)
> val validAssembled = assembler.transform(valid)
> val svm = new LinearSVC()
> .setMaxIter(10)
> .setRegParam(0.1)
> val model = svm.fit(trainAssembled)
> val predictions = model.transform(validAssembled)
> val recallByLabel = new MulticlassClassificationEvaluator()
> .setMetricName("recallByLabel")
> val weightedRecall = new MulticlassClassificationEvaluator()
> .setMetricName("weightedRecall")
> println(s"Recall by label: ${recallByLabel.evaluate(predictions)}")
> println(s"Weighted recall: ${weightedRecall.evaluate(predictions)}"){code}
> produces the same recall by label and weighted recall, as described above.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]