[ https://issues.apache.org/jira/browse/SPARK-14489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15240360#comment-15240360 ]
Joseph K. Bradley commented on SPARK-14489: ------------------------------------------- I'd to try to separate a few issues here based on use cases and suggest the "right thing to do" in each case: * Deploying an ALSModel to make predictions: The model should make best-effort predictions, even for new users. I'd say new users should get recommendations based on the average user, for both the explicit and implicit settings. Providing a Param which makes the model output NaN for unknown users seems reasonable as an additional feature. * Evaluating an ALSModel on a held-out dataset: This is the same as the first case; the model should behave the same way it will when deployed. * Model tuning using CrossValidator: I'm less sure about this. Both of your suggestions seem reasonable (either returning NaN for missing users and ignoring NaN in the evaluator, or making best-effort predictions for all users). I also suspect it would be worthwhile to examine literature to find what tends to be best. E.g., should CrossValidator handle ranking specially by doing stratified sampling to divide each user or item's ratings evenly across folds of CV? If we want the evaluator to be able to ignore NaNs, then I'd prefer we keep the current behavior as the default and provide a Param which allows users to ignore NaNs. I'd be afraid of linear models not having enough regularization, getting NaNs in the coefficients, having all of its predictions ignored by the evaluator, etc. What do you think? > RegressionEvaluator returns NaN for ALS in Spark ml > --------------------------------------------------- > > Key: SPARK-14489 > URL: https://issues.apache.org/jira/browse/SPARK-14489 > Project: Spark > Issue Type: Bug > Components: ML > Affects Versions: 1.6.0 > Environment: AWS EMR > Reporter: Boris Clémençon > Labels: patch > Original Estimate: 4h > Remaining Estimate: 4h > > When building a Spark ML pipeline containing an ALS estimator, the metrics > "rmse", "mse", "r2" and "mae" all return NaN. > The reason is in CrossValidator.scala line 109. The K-folds are randomly > generated. For large and sparse datasets, there is a significant probability > that at least one user of the validation set is missing in the training set, > hence generating a few NaN estimation with transform method and NaN > RegressionEvaluator's metrics too. > Suggestion to fix the bug: remove the NaN values while computing the rmse or > other metrics (ie, removing users or items in validation test that is missing > in the learning set). Send logs when this happen. > Issue SPARK-14153 seems to be the same pbm > {code:title=Bar.scala|borderStyle=solid} > val splits = MLUtils.kFold(dataset.rdd, $(numFolds), 0) > splits.zipWithIndex.foreach { case ((training, validation), splitIndex) => > val trainingDataset = sqlCtx.createDataFrame(training, schema).cache() > val validationDataset = sqlCtx.createDataFrame(validation, > schema).cache() > // multi-model training > logDebug(s"Train split $splitIndex with multiple sets of parameters.") > val models = est.fit(trainingDataset, epm).asInstanceOf[Seq[Model[_]]] > trainingDataset.unpersist() > var i = 0 > while (i < numModels) { > // TODO: duplicate evaluator to take extra params from input > val metric = eval.evaluate(models(i).transform(validationDataset, > epm(i))) > logDebug(s"Got metric $metric for model trained with ${epm(i)}.") > metrics(i) += metric > i += 1 > } > validationDataset.unpersist() > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org