[ 
https://issues.apache.org/jira/browse/SPARK-14489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15240360#comment-15240360
 ] 

Joseph K. Bradley commented on SPARK-14489:
-------------------------------------------

I'd to try to separate a few issues here based on use cases and suggest the 
"right thing to do" in each case:
* Deploying an ALSModel to make predictions: The model should make best-effort 
predictions, even for new users.  I'd say new users should get recommendations 
based on the average user, for both the explicit and implicit settings.  
Providing a Param which makes the model output NaN for unknown users seems 
reasonable as an additional feature.
* Evaluating an ALSModel on a held-out dataset: This is the same as the first 
case; the model should behave the same way it will when deployed.
* Model tuning using CrossValidator: I'm less sure about this.  Both of your 
suggestions seem reasonable (either returning NaN for missing users and 
ignoring NaN in the evaluator, or making best-effort predictions for all 
users).  I also suspect it would be worthwhile to examine literature to find 
what tends to be best.  E.g., should CrossValidator handle ranking specially by 
doing stratified sampling to divide each user or item's ratings evenly across 
folds of CV?

If we want the evaluator to be able to ignore NaNs, then I'd prefer we keep the 
current behavior as the default and provide a Param which allows users to 
ignore NaNs.  I'd be afraid of linear models not having enough regularization, 
getting NaNs in the coefficients, having all of its predictions ignored by the 
evaluator, etc.

What do you think?

> RegressionEvaluator returns NaN for ALS in Spark ml
> ---------------------------------------------------
>
>                 Key: SPARK-14489
>                 URL: https://issues.apache.org/jira/browse/SPARK-14489
>             Project: Spark
>          Issue Type: Bug
>          Components: ML
>    Affects Versions: 1.6.0
>         Environment: AWS EMR
>            Reporter: Boris Clémençon 
>              Labels: patch
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> When building a Spark ML pipeline containing an ALS estimator, the metrics 
> "rmse", "mse", "r2" and "mae" all return NaN. 
> The reason is in CrossValidator.scala line 109. The K-folds are randomly 
> generated. For large and sparse datasets, there is a significant probability 
> that at least one user of the validation set is missing in the training set, 
> hence generating a few NaN estimation with transform method and NaN 
> RegressionEvaluator's metrics too. 
> Suggestion to fix the bug: remove the NaN values while computing the rmse or 
> other metrics (ie, removing users or items in validation test that is missing 
> in the learning set). Send logs when this happen.
> Issue SPARK-14153 seems to be the same pbm
> {code:title=Bar.scala|borderStyle=solid}
>     val splits = MLUtils.kFold(dataset.rdd, $(numFolds), 0)
>     splits.zipWithIndex.foreach { case ((training, validation), splitIndex) =>
>       val trainingDataset = sqlCtx.createDataFrame(training, schema).cache()
>       val validationDataset = sqlCtx.createDataFrame(validation, 
> schema).cache()
>       // multi-model training
>       logDebug(s"Train split $splitIndex with multiple sets of parameters.")
>       val models = est.fit(trainingDataset, epm).asInstanceOf[Seq[Model[_]]]
>       trainingDataset.unpersist()
>       var i = 0
>       while (i < numModels) {
>         // TODO: duplicate evaluator to take extra params from input
>         val metric = eval.evaluate(models(i).transform(validationDataset, 
> epm(i)))
>         logDebug(s"Got metric $metric for model trained with ${epm(i)}.")
>         metrics(i) += metric
>         i += 1
>       }
>       validationDataset.unpersist()
>     }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to