viirya commented on a change in pull request #25789: [SPARK-28927][ML] Input 
data to ALS can not be indeterminate
URL: https://github.com/apache/spark/pull/25789#discussion_r324432557
 
 

 ##########
 File path: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala
 ##########
 @@ -920,6 +924,14 @@ object ALS extends DefaultParamsReadable[ALS] with 
Logging {
     require(intermediateRDDStorageLevel != StorageLevel.NONE,
       "ALS is not designed to run without persisting intermediate RDDs.")
 
+    // Indeterminate rating RDD causes inconsistent in/out blocks in case of 
rerun.
+    // It can cause runtime error when matching in/out user/item blocks.
+    if (ratings.outputDeterministicLevel == DeterministicLevel.INDETERMINATE) {
+      throw new IllegalArgumentException("The output of rating RDD can not be 
indeterminate. " +
+        "If your training data has indeterminate RDD computations, like 
`randomSplit` or `sample`" +
+        ", please checkpoint the training data before running ALS.")
 
 Review comment:
   This kind of failure only happens when you lost blocks of training data. 
That can be cached blocks, or map outputs. It is not definitely going to fail 
in all cases. But it is more likely, when fitting ALS on big amount data. This 
is hard to reproduce in unit test or small dataset.
   
   Checkpointing changes RDD output to deterministic:
   
   
https://github.com/apache/spark/blob/c610de69526d65f3b679cfd81ab7e1a5791ff37f/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L1919-L1925
   
   I think checkpointed RDD can't be rerun like cache or normal RDD, because 
checkpointing will clean up RDD lineage. When you lose checkpoint, the job 
should be failed.
   
   Yeah, a warning sounds good. I also added notes to documents, hopefully this 
can notify users about this issue.
   
   
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to