srowen commented on a change in pull request #25789: [SPARK-28927][ML] Input
data to ALS can not be indeterminate
URL: https://github.com/apache/spark/pull/25789#discussion_r324431505
##########
File path: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala
##########
@@ -920,6 +924,14 @@ object ALS extends DefaultParamsReadable[ALS] with
Logging {
require(intermediateRDDStorageLevel != StorageLevel.NONE,
"ALS is not designed to run without persisting intermediate RDDs.")
+ // Indeterminate rating RDD causes inconsistent in/out blocks in case of
rerun.
+ // It can cause runtime error when matching in/out user/item blocks.
+ if (ratings.outputDeterministicLevel == DeterministicLevel.INDETERMINATE) {
+ throw new IllegalArgumentException("The output of rating RDD can not be
indeterminate. " +
+ "If your training data has indeterminate RDD computations, like
`randomSplit` or `sample`" +
+ ", please checkpoint the training data before running ALS.")
Review comment:
I think this is going to break a lot of user code, although you are strictly
speaking quite correct. If I do some train/test split and even cache() the
results, which is the usual practice, it will fail now right? because the
cached result is not considered deterministic. Does checkpointing change the
determinism level? what if you lose the checkpoint?
It's not a wild idea. But this problem exists everywhere in theory in Spark.
Any test/train split basically has this problem. I hesitate to enforce this so
strictly everywhere? how about a warning?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]