agrawaldevesh commented on pull request #35005:
URL: https://github.com/apache/spark/pull/35005#issuecomment-1003279326


   
   > ```
   > rdd.checkpoint()
   > rdd.count
   > ```
   > 
   > Spark will run the job twice? This looks like an existing bug in spark 
core. I'm fine with this PR as a workaround at the SQL side.
   
   Indeed. The first `rdd.checkpoint` will merely mark the RDD for 
checkpointing. If we open up the actual "RDD.doCheckpoint" call as a public API 
then users can call that and it shall eagerly checkpoint. 
   
   But otherwise: Rdd.count will run the spark job twice: Once for the actual 
"count" and then SparkContext.runJob shall call "RDD.doCheckpoint" which will 
finally trigger SparkContext.runJob again for the "writing of the checkpoint" 
output. 
   
   Do you think we should open up the RDD.doCheckpoint API ? I am not sure 
about the ramifications of that so I tried to do something scoped to the 
Dataframe/SQL side. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to