[
https://issues.apache.org/jira/browse/SPARK-14489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15232357#comment-15232357
]
Sean Owen commented on SPARK-14489:
-----------------------------------
NaN, to me, means the result was undefined or uncomputable. However for
recommenders there's nothing too strange about being asked for a recommendation
for a new user. For some methods there's a clear answer: a new user with no
data gets 0 recommendations; 0 is the meaningful default for the implicit case.
Some kind of global mean is better than nothing for the explicit case. It
doesn't bias the metrics, as an answer is an answer; some are better than
others but that's what we're measuring.
As I say the problem with ignoring NaN is that you don't consider these cases,
but they're legitimate cases where the recommender wasn't able to produce a
result, and that should be reflected as "bad".
Still, as a stop-gap, assuming NaN is rare, ignoring NaN in the evaluator is
strictly an improvement since it means you can return some meaningful answer
instead of none. Later, if the ALS implementation never returns NaN, then this
behavior in the evaluator doesn't matter anyway. So I'd support that change as
a local improvement.
> RegressionEvaluator returns NaN for ALS in Spark ml
> ---------------------------------------------------
>
> Key: SPARK-14489
> URL: https://issues.apache.org/jira/browse/SPARK-14489
> Project: Spark
> Issue Type: Bug
> Components: ML
> Affects Versions: 1.6.0
> Environment: AWS EMR
> Reporter: Boris Clémençon
> Labels: patch
> Original Estimate: 4h
> Remaining Estimate: 4h
>
> When building a Spark ML pipeline containing an ALS estimator, the metrics
> "rmse", "mse", "r2" and "mae" all return NaN.
> The reason is in CrossValidator.scala line 109. The K-folds are randomly
> generated. For large and sparse datasets, there is a significant probability
> that at least one user of the validation set is missing in the training set,
> hence generating a few NaN estimation with transform method and NaN
> RegressionEvaluator's metrics too.
> Suggestion to fix the bug: remove the NaN values while computing the rmse or
> other metrics (ie, removing users or items in validation test that is missing
> in the learning set). Send logs when this happen.
> Issue SPARK-14153 seems to be the same pbm
> {code:title=Bar.scala|borderStyle=solid}
> val splits = MLUtils.kFold(dataset.rdd, $(numFolds), 0)
> splits.zipWithIndex.foreach { case ((training, validation), splitIndex) =>
> val trainingDataset = sqlCtx.createDataFrame(training, schema).cache()
> val validationDataset = sqlCtx.createDataFrame(validation,
> schema).cache()
> // multi-model training
> logDebug(s"Train split $splitIndex with multiple sets of parameters.")
> val models = est.fit(trainingDataset, epm).asInstanceOf[Seq[Model[_]]]
> trainingDataset.unpersist()
> var i = 0
> while (i < numModels) {
> // TODO: duplicate evaluator to take extra params from input
> val metric = eval.evaluate(models(i).transform(validationDataset,
> epm(i)))
> logDebug(s"Got metric $metric for model trained with ${epm(i)}.")
> metrics(i) += metric
> i += 1
> }
> validationDataset.unpersist()
> }
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]