agrawaldevesh commented on pull request #35005: URL: https://github.com/apache/spark/pull/35005#issuecomment-1004372726
Hi @mridulm , Thanks for pointing out that it is not necessary for checkpointing to rerun the whole job. But in practice, we have noticed that it often does end up re-running the job for the following reasons: - Many user jobs don't have a shuffle, for example simple ETL or ML training jobs. - Users sometimes don't cache/persist the RDD prior to calling checkpoint. They do this sometimes out of oversight, sometimes because they notice that Dynamic Executor Allocation does not play well with .cache/.persist. In this case as you note, the whole non-shuffle would be rerun. We have noticed that while caching/persisting can be faster than checkpointing to HDFS, it is a bit less robust in our environment, and thus we see users going for the latter approach. We tried this PR in a couple of internal heavy ML training jobs that use Checkpointing and it results in about 40% improvement. Do you have any concerns about the approach in this PR -- or perhaps you think there is an alternative way to implement this ? Thanks ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
