JarroVGIT commented on issue #1829:
URL: 
https://github.com/apache/datafusion-ballista/issues/1829#issuecomment-4727961494

   That really depends on what you mean with `checkpoint()` of course ;) In 
Spark, i believe `df.checkpoint()` writes the RDD to disk while `df.cache()` 
keeps the results in memory (with potentially spilling to disk). Another 
difference is that `cache` can be recomputed because it retains the logical 
plan, whereas `checkpoint` rewrites the logical plan to point to the persisted 
result. 
   
   I actually hadn't thought of it through this point of view, I essentially 
mixed both into one issue to be honest. Part of the reason why I lean towards 
re-using the shuffle mechanics was #1539, it would naturally fit with that and 
would help survive executor interuptions. Now I realize that my approach above 
basically **is** describing an equivalent to `checkpoint` rather than `cache`. 
   
   I am working on this a bit today and tomorrow, what are your thoughts on 
this? Is the below a correct summary?
   
   - `cache()`:
     - Lazy computation
     - Client retains the logical plan, just adds a `CacheNode` with a 
`cache_id`
     - Keeps results in executor's memory (potential spilling to disk)
   - `checkpoint()`:
     - Not lazy
     - Returns a new df with new logical plan that points to the checkpointed 
results (like being a table, essentially?)
     - Writes results as shuffle files, scheduler registers it as new 
`CheckPointTable`
   
   Is this a more correct mental model?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to