viirya commented on a change in pull request #25789: [SPARK-28927][ML] Input 
data to ALS can not be indeterminate
URL: https://github.com/apache/spark/pull/25789#discussion_r324412870
 
 

 ##########
 File path: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala
 ##########
 @@ -920,6 +924,14 @@ object ALS extends DefaultParamsReadable[ALS] with 
Logging {
     require(intermediateRDDStorageLevel != StorageLevel.NONE,
       "ALS is not designed to run without persisting intermediate RDDs.")
 
+    // Indeterminate rating RDD causes inconsistent in/out blocks in case of 
rerun.
+    // It can cause runtime error when matching in/out user/item blocks.
+    if (ratings.outputDeterministicLevel == DeterministicLevel.INDETERMINATE) {
 
 Review comment:
   It is because how ALS is implemented.
   
   ALS uses the training data RDD, to make user/item in/out blocks, like:
   
   training RDD -> user in/out block
   training RDD -> item in/out block
   
   Later, it matches user in block with item out block, and item in block with 
user out block.
   
   If the training RDD is indeterminate, any rerun of its tasks can produce 
different output in such user/item blocks. In any iteration, if such rerun 
happens, mismatch hits and user/item index can't find correspond slot in 
user/item factors.
   
   
   
   
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to