Github user holdenk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11919#discussion_r60511187
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala 
---
    @@ -656,7 +656,8 @@ object ALS extends DefaultParamsReadable[ALS] with 
Logging {
             
itemFactors.setName(s"itemFactors-$iter").persist(intermediateRDDStorageLevel)
             // TODO: Generalize PeriodicGraphCheckpointer and use it here.
             if (shouldCheckpoint(iter)) {
    -          itemFactors.checkpoint() // itemFactors gets materialized in 
computeFactors.
    +          // itemFactors gets materialized in computeFactors & here.
    +          ALS.checkpointAndCleanParents(itemFactors)
    --- End diff --
    
    @MLnick Actually, that plan doesn't work - checkpointing kills all of the 
parents information we need to clean up the shuffle files. I could refactor 
this so that we capture the dependency information needed - but a count() on a 
cached RDD should be low enough cost I'm not sure it would be worth it. What 
are your thoughts?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to