[
https://issues.apache.org/jira/browse/SPARK-14489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15242251#comment-15242251
]
Seth Hendrickson commented on SPARK-14489:
------------------------------------------
[~mlnick] I am skeptical that
[SPARK-8971|https://issues.apache.org/jira/browse/SPARK-8971] applies here. In
order to guarantee that the user proportions are maintained in each sample, we
need to use the Scalable Simple Random Sampling algorithm. From my
understanding, this will not work well for small stratums like you might
encounter in a recommendation setting. Say you need to guarantee that a user
with 100 ratings appears in each of 5 folds. The ScaSRS uses a method where it
only needs to sort the number of items in a waitlist, which depends on the
probability of acceptance, the sample size, and the desired accuracy. For a
loose accuracy setting, I compute that the expected waitlist size in this
scenario is about 20 - the size of the sample! This degrades to naive sampling.
In this situation, I get the following:
(numRatings, expectedWaitListSize)
(100, 20.36)
(1000, 61.59)
(10000, 192.75)
(100000, 607.75)
(1000000, 1920.18)
I am using [this paper|http://jmlr.org/proceedings/papers/v28/meng13a.pdf] as a
reference. Perhaps [~mengxr] could clarify since he wrote the paper :D ?
> RegressionEvaluator returns NaN for ALS in Spark ml
> ---------------------------------------------------
>
> Key: SPARK-14489
> URL: https://issues.apache.org/jira/browse/SPARK-14489
> Project: Spark
> Issue Type: Bug
> Components: ML
> Affects Versions: 1.6.0
> Environment: AWS EMR
> Reporter: Boris Clémençon
> Labels: patch
> Original Estimate: 4h
> Remaining Estimate: 4h
>
> When building a Spark ML pipeline containing an ALS estimator, the metrics
> "rmse", "mse", "r2" and "mae" all return NaN.
> The reason is in CrossValidator.scala line 109. The K-folds are randomly
> generated. For large and sparse datasets, there is a significant probability
> that at least one user of the validation set is missing in the training set,
> hence generating a few NaN estimation with transform method and NaN
> RegressionEvaluator's metrics too.
> Suggestion to fix the bug: remove the NaN values while computing the rmse or
> other metrics (ie, removing users or items in validation test that is missing
> in the learning set). Send logs when this happen.
> Issue SPARK-14153 seems to be the same pbm
> {code:title=Bar.scala|borderStyle=solid}
> val splits = MLUtils.kFold(dataset.rdd, $(numFolds), 0)
> splits.zipWithIndex.foreach { case ((training, validation), splitIndex) =>
> val trainingDataset = sqlCtx.createDataFrame(training, schema).cache()
> val validationDataset = sqlCtx.createDataFrame(validation,
> schema).cache()
> // multi-model training
> logDebug(s"Train split $splitIndex with multiple sets of parameters.")
> val models = est.fit(trainingDataset, epm).asInstanceOf[Seq[Model[_]]]
> trainingDataset.unpersist()
> var i = 0
> while (i < numModels) {
> // TODO: duplicate evaluator to take extra params from input
> val metric = eval.evaluate(models(i).transform(validationDataset,
> epm(i)))
> logDebug(s"Got metric $metric for model trained with ${epm(i)}.")
> metrics(i) += metric
> i += 1
> }
> validationDataset.unpersist()
> }
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]