agrawaldevesh commented on pull request #35005: URL: https://github.com/apache/spark/pull/35005#issuecomment-1003279326
> ``` > rdd.checkpoint() > rdd.count > ``` > > Spark will run the job twice? This looks like an existing bug in spark core. I'm fine with this PR as a workaround at the SQL side. Indeed. The first `rdd.checkpoint` will merely mark the RDD for checkpointing. If we open up the actual "RDD.doCheckpoint" call as a public API then users can call that and it shall eagerly checkpoint. But otherwise: Rdd.count will run the spark job twice: Once for the actual "count" and then SparkContext.runJob shall call "RDD.doCheckpoint" which will finally trigger SparkContext.runJob again for the "writing of the checkpoint" output. Do you think we should open up the RDD.doCheckpoint API ? I am not sure about the ramifications of that so I tried to do something scoped to the Dataframe/SQL side. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
