[
https://issues.apache.org/jira/browse/SPARK-14489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15392293#comment-15392293
]
Krishna Sankar edited comment on SPARK-14489 at 7/25/16 5:04 PM:
-----------------------------------------------------------------
>From my experience in the field and R experience, couple of thoughts:
The ALS and the evaluator are doing the right thing - with the information they
have and without any contextual directives.
1. For the evaluator, as mentioned earlier, similar to what R has, a na.rm flag
(ignoreNaN=false, to keep the current behavior) would be a good choice. I have
a suspicion that we would need the ignoreNaN elsewhere as well, for example in
the crossValidator
2. For ALS, in the absence of a directive, we shouldn't calculate a default
average recommendation or even 0; current NaN is the right one. Depending on
the context it is possible that an application might decide not to recommend
anything, have a default recommendation or even have a dynamic calculated value
e.g. over a recent window. So the parameter defaultRecommendation="NaN" or
"average" or a value would be a good choice to cover all the possibilities. Or
the developer can use the na.fill() for other operations.
Note : Saw the coldStartStrategy in Nick's patch. Will dig further.
was (Author: ksankar):
>From my experience in the field and R experience, couple of thoughts:
The ALS and the evaluator are doing the right thing - with the information they
have and without any contextual directives.
1. For the evaluator, as mentioned earlier, similar to what R has, a na.rm flag
(ignoreNaN=false, to keep the current behavior) would be a good choice. I have
a suspicion that we would need the ignoreNaN elsewhere as well, for example in
the crossValidator
2. For ALS, in the absence of a directive, we shouldn't calculate a default
average recommendation or even 0; current NaN is the right one. Depending on
the context it is possible that an application might decide not to recommend
anything, have a default recommendation or even have a dynamic calculated value
e.g. over a recent window. So a parameter defaultRecommendation="NaN" or
"average" or a value would be a good choice to cover all the possibilities. Or
the developer can use the na.fill() for other operations.
> RegressionEvaluator returns NaN for ALS in Spark ml
> ---------------------------------------------------
>
> Key: SPARK-14489
> URL: https://issues.apache.org/jira/browse/SPARK-14489
> Project: Spark
> Issue Type: Bug
> Components: ML
> Affects Versions: 1.6.0
> Environment: AWS EMR
> Reporter: Boris Clémençon
> Labels: patch
> Original Estimate: 4h
> Remaining Estimate: 4h
>
> When building a Spark ML pipeline containing an ALS estimator, the metrics
> "rmse", "mse", "r2" and "mae" all return NaN.
> The reason is in CrossValidator.scala line 109. The K-folds are randomly
> generated. For large and sparse datasets, there is a significant probability
> that at least one user of the validation set is missing in the training set,
> hence generating a few NaN estimation with transform method and NaN
> RegressionEvaluator's metrics too.
> Suggestion to fix the bug: remove the NaN values while computing the rmse or
> other metrics (ie, removing users or items in validation test that is missing
> in the learning set). Send logs when this happen.
> Issue SPARK-14153 seems to be the same pbm
> {code:title=Bar.scala|borderStyle=solid}
> val splits = MLUtils.kFold(dataset.rdd, $(numFolds), 0)
> splits.zipWithIndex.foreach { case ((training, validation), splitIndex) =>
> val trainingDataset = sqlCtx.createDataFrame(training, schema).cache()
> val validationDataset = sqlCtx.createDataFrame(validation,
> schema).cache()
> // multi-model training
> logDebug(s"Train split $splitIndex with multiple sets of parameters.")
> val models = est.fit(trainingDataset, epm).asInstanceOf[Seq[Model[_]]]
> trainingDataset.unpersist()
> var i = 0
> while (i < numModels) {
> // TODO: duplicate evaluator to take extra params from input
> val metric = eval.evaluate(models(i).transform(validationDataset,
> epm(i)))
> logDebug(s"Got metric $metric for model trained with ${epm(i)}.")
> metrics(i) += metric
> i += 1
> }
> validationDataset.unpersist()
> }
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]