Dong Wang created SPARK-29812:

             Summary: Missing persist on predictionAndLabels in 
                 Key: SPARK-29812
             Project: Spark
          Issue Type: Improvement
          Components: ML
    Affects Versions: 2.4.3
            Reporter: Dong Wang

The rdd predictionAndLabels in 
ml.evaluation.MulticlassificationEvaluator.evaluate() needs to be persisted. 
When MulticlassMetrics uses predictionAndLabels to initialize fileds, there 
will be at least five actions executed on predictionAndLabels.
  override def evaluate(dataset: Dataset[_]): Double = {
    val schema = dataset.schema
    SchemaUtils.checkColumnType(schema, $(predictionCol), DoubleType)
    SchemaUtils.checkNumericType(schema, $(labelCol))
    // Needs to be persisted
    val predictionAndLabels =$(predictionCol)), 
col($(labelCol)).cast(DoubleType)) {
        case Row(prediction: Double, label: Double) => (prediction, label)
    // The initialization will use predictionAndLabels multi times in different 
    val metrics = new MulticlassMetrics(predictionAndLabels)
    val metric = $(metricName) match {
      case "f1" => metrics.weightedFMeasure
      case "weightedPrecision" => metrics.weightedPrecision
      case "weightedRecall" => metrics.weightedRecall
      case "accuracy" => metrics.accuracy

This issue is reported by our tool CacheCheck, which is used to dynamically 
detecting persist()/unpersist() api misuses.

This message was sent by Atlassian Jira

To unsubscribe, e-mail:
For additional commands, e-mail:

Reply via email to