[
https://issues.apache.org/jira/browse/SPARK-14489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15240360#comment-15240360
]
Joseph K. Bradley commented on SPARK-14489:
-------------------------------------------
I'd to try to separate a few issues here based on use cases and suggest the
"right thing to do" in each case:
* Deploying an ALSModel to make predictions: The model should make best-effort
predictions, even for new users. I'd say new users should get recommendations
based on the average user, for both the explicit and implicit settings.
Providing a Param which makes the model output NaN for unknown users seems
reasonable as an additional feature.
* Evaluating an ALSModel on a held-out dataset: This is the same as the first
case; the model should behave the same way it will when deployed.
* Model tuning using CrossValidator: I'm less sure about this. Both of your
suggestions seem reasonable (either returning NaN for missing users and
ignoring NaN in the evaluator, or making best-effort predictions for all
users). I also suspect it would be worthwhile to examine literature to find
what tends to be best. E.g., should CrossValidator handle ranking specially by
doing stratified sampling to divide each user or item's ratings evenly across
folds of CV?
If we want the evaluator to be able to ignore NaNs, then I'd prefer we keep the
current behavior as the default and provide a Param which allows users to
ignore NaNs. I'd be afraid of linear models not having enough regularization,
getting NaNs in the coefficients, having all of its predictions ignored by the
evaluator, etc.
What do you think?
> RegressionEvaluator returns NaN for ALS in Spark ml
> ---------------------------------------------------
>
> Key: SPARK-14489
> URL: https://issues.apache.org/jira/browse/SPARK-14489
> Project: Spark
> Issue Type: Bug
> Components: ML
> Affects Versions: 1.6.0
> Environment: AWS EMR
> Reporter: Boris Clémençon
> Labels: patch
> Original Estimate: 4h
> Remaining Estimate: 4h
>
> When building a Spark ML pipeline containing an ALS estimator, the metrics
> "rmse", "mse", "r2" and "mae" all return NaN.
> The reason is in CrossValidator.scala line 109. The K-folds are randomly
> generated. For large and sparse datasets, there is a significant probability
> that at least one user of the validation set is missing in the training set,
> hence generating a few NaN estimation with transform method and NaN
> RegressionEvaluator's metrics too.
> Suggestion to fix the bug: remove the NaN values while computing the rmse or
> other metrics (ie, removing users or items in validation test that is missing
> in the learning set). Send logs when this happen.
> Issue SPARK-14153 seems to be the same pbm
> {code:title=Bar.scala|borderStyle=solid}
> val splits = MLUtils.kFold(dataset.rdd, $(numFolds), 0)
> splits.zipWithIndex.foreach { case ((training, validation), splitIndex) =>
> val trainingDataset = sqlCtx.createDataFrame(training, schema).cache()
> val validationDataset = sqlCtx.createDataFrame(validation,
> schema).cache()
> // multi-model training
> logDebug(s"Train split $splitIndex with multiple sets of parameters.")
> val models = est.fit(trainingDataset, epm).asInstanceOf[Seq[Model[_]]]
> trainingDataset.unpersist()
> var i = 0
> while (i < numModels) {
> // TODO: duplicate evaluator to take extra params from input
> val metric = eval.evaluate(models(i).transform(validationDataset,
> epm(i)))
> logDebug(s"Got metric $metric for model trained with ${epm(i)}.")
> metrics(i) += metric
> i += 1
> }
> validationDataset.unpersist()
> }
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]