agrawaldevesh commented on pull request #35005:
URL: https://github.com/apache/spark/pull/35005#issuecomment-1004372726


   Hi @mridulm , Thanks for pointing out that it is not necessary for 
checkpointing to rerun the whole job. But in practice, we have noticed that it 
often does end up re-running the job for the following reasons:
   - Many user jobs don't have a shuffle, for example simple ETL or ML training 
jobs.
   - Users sometimes don't cache/persist the RDD prior to calling checkpoint. 
They do this sometimes out of oversight, sometimes because they notice that 
Dynamic Executor Allocation does not play well with .cache/.persist. In this 
case as you note, the whole non-shuffle would be rerun. 
   
   We have noticed that while caching/persisting can be faster than 
checkpointing to HDFS, it is a bit less robust in our environment, and thus we 
see users going for the latter approach.
   
   We tried this PR in a couple of internal heavy ML training jobs that use 
Checkpointing and it results in about 40% improvement. 
   
   Do you have any concerns about the approach in this PR -- or perhaps you 
think there is an alternative way to implement this ?
   
   Thanks !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to