[
https://issues.apache.org/jira/browse/SPARK-14489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15240718#comment-15240718
]
Nick Pentreath commented on SPARK-14489:
----------------------------------------
+1 for having CrossValidator be able to handle this in a more principled way by
doing stratified sampling by say one of the input columns (user id for
example). This links to SPARK-8971 (which is focused on sampling by class
label, but I think can be generalized to sampling by any input column).
Until we have something like this, allowing skipping NaNs in the evaluators is
perhaps the best option. If we agree I can take a look at that - we could make
it an "expertParam" setting with appropriate warning in the doc.
I like the "average user" option in ALS a lot too. We can offer both options,
and provide some documentation about common use cases for them, as well as
expanding the ALS examples to illustrate this.
Finally, is the case for #1 and #2 for a new item different from a new user? It
may make sense to recommend based on the average user for a new user in the
absence of any data, but does this make sense for a new item? I'm not sure,
though it doesn't make as much sense to me.
> RegressionEvaluator returns NaN for ALS in Spark ml
> ---------------------------------------------------
>
> Key: SPARK-14489
> URL: https://issues.apache.org/jira/browse/SPARK-14489
> Project: Spark
> Issue Type: Bug
> Components: ML
> Affects Versions: 1.6.0
> Environment: AWS EMR
> Reporter: Boris Clémençon
> Labels: patch
> Original Estimate: 4h
> Remaining Estimate: 4h
>
> When building a Spark ML pipeline containing an ALS estimator, the metrics
> "rmse", "mse", "r2" and "mae" all return NaN.
> The reason is in CrossValidator.scala line 109. The K-folds are randomly
> generated. For large and sparse datasets, there is a significant probability
> that at least one user of the validation set is missing in the training set,
> hence generating a few NaN estimation with transform method and NaN
> RegressionEvaluator's metrics too.
> Suggestion to fix the bug: remove the NaN values while computing the rmse or
> other metrics (ie, removing users or items in validation test that is missing
> in the learning set). Send logs when this happen.
> Issue SPARK-14153 seems to be the same pbm
> {code:title=Bar.scala|borderStyle=solid}
> val splits = MLUtils.kFold(dataset.rdd, $(numFolds), 0)
> splits.zipWithIndex.foreach { case ((training, validation), splitIndex) =>
> val trainingDataset = sqlCtx.createDataFrame(training, schema).cache()
> val validationDataset = sqlCtx.createDataFrame(validation,
> schema).cache()
> // multi-model training
> logDebug(s"Train split $splitIndex with multiple sets of parameters.")
> val models = est.fit(trainingDataset, epm).asInstanceOf[Seq[Model[_]]]
> trainingDataset.unpersist()
> var i = 0
> while (i < numModels) {
> // TODO: duplicate evaluator to take extra params from input
> val metric = eval.evaluate(models(i).transform(validationDataset,
> epm(i)))
> logDebug(s"Got metric $metric for model trained with ${epm(i)}.")
> metrics(i) += metric
> i += 1
> }
> validationDataset.unpersist()
> }
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]