[jira] [Comment Edited] (SPARK-14489) RegressionEvaluator returns NaN for ALS in Spark ml

Nick Pentreath (JIRA) Wed, 27 Jul 2016 02:03:50 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-14489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15395275#comment-15395275
 ]


Nick Pentreath edited comment on SPARK-14489 at 7/27/16 9:03 AM:
-----------------------------------------------------------------

Thanks for the thoughts Krishna.

# Initially I also thought a flag to ignore NaN in the evaluators would make 
sense. However frankly I have never seen (and I can't think of) a situation 
where this is desirable, _outside_ of this situation where splitting the 
dataset can result in user/item ids the model has not been trained on (this 
applies in general to "ranking" cases). But for all other typical supervised 
learning cases, NaN means either (a) NaN inputs, in which case that should be 
dealt with by the user in the pipeline before training; (b) a model that has 
bad coefficients. In both these cases, I'd argue that it is correct to return 
NaN, and not desirable to ignore NaN;
# Hence the approach is to rather "fix" the issue in ALS itself with the 
{{coldStartStrategy}} param. In future it can also handle other more elaborate 
strategies for the case of batch prediction modes. Though frankly, ALS in Spark 
is really more for training, because batch prediction is usually a "top-k" 
recommendation situation, where doing things with brute force is typically not 
the approach you want to be using.

Please do comment on the linked PR for {{coldStartStrategy}}


was (Author: mlnick):
Thanks for the thoughts Krishna.

# Initially I also thought a flag to ignore NaN in the evaluators would make 
sense. However frankly I have never seen (and I can't think of) a situation 
where this is desirable, _outside_ of this situation where splitting the 
dataset can result in user/item ids the model has not been trained on (this 
applies in general to "ranking" cases). But for all other typical supervised 
learning cases, NaN means either (a) NaN inputs, in which case that should be 
dealt with by the user in the pipeline before training; (b) a model that has 
bad coefficients. In both these cases, I'd argue that it is correct to return 
NaN, and not desirable to ignore NaN;

> RegressionEvaluator returns NaN for ALS in Spark ml
> ---------------------------------------------------
>
>                 Key: SPARK-14489
>                 URL: https://issues.apache.org/jira/browse/SPARK-14489
>             Project: Spark
>          Issue Type: Bug
>          Components: ML
>    Affects Versions: 1.6.0
>         Environment: AWS EMR
>            Reporter: Boris Clémençon 
>              Labels: patch
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> When building a Spark ML pipeline containing an ALS estimator, the metrics 
> "rmse", "mse", "r2" and "mae" all return NaN. 
> The reason is in CrossValidator.scala line 109. The K-folds are randomly 
> generated. For large and sparse datasets, there is a significant probability 
> that at least one user of the validation set is missing in the training set, 
> hence generating a few NaN estimation with transform method and NaN 
> RegressionEvaluator's metrics too. 
> Suggestion to fix the bug: remove the NaN values while computing the rmse or 
> other metrics (ie, removing users or items in validation test that is missing 
> in the learning set). Send logs when this happen.
> Issue SPARK-14153 seems to be the same pbm
> {code:title=Bar.scala|borderStyle=solid}
>     val splits = MLUtils.kFold(dataset.rdd, $(numFolds), 0)
>     splits.zipWithIndex.foreach { case ((training, validation), splitIndex) =>
>       val trainingDataset = sqlCtx.createDataFrame(training, schema).cache()
>       val validationDataset = sqlCtx.createDataFrame(validation, 
> schema).cache()
>       // multi-model training
>       logDebug(s"Train split $splitIndex with multiple sets of parameters.")
>       val models = est.fit(trainingDataset, epm).asInstanceOf[Seq[Model[_]]]
>       trainingDataset.unpersist()
>       var i = 0
>       while (i < numModels) {
>         // TODO: duplicate evaluator to take extra params from input
>         val metric = eval.evaluate(models(i).transform(validationDataset, 
> epm(i)))
>         logDebug(s"Got metric $metric for model trained with ${epm(i)}.")
>         metrics(i) += metric
>         i += 1
>       }
>       validationDataset.unpersist()
>     }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (SPARK-14489) RegressionEvaluator returns NaN for ALS in Spark ml

Reply via email to