JarroVGIT commented on issue #1829:
URL:
https://github.com/apache/datafusion-ballista/issues/1829#issuecomment-4727961494
That really depends on what you mean with `checkpoint()` of course ;) In
Spark, i believe `df.checkpoint()` writes the RDD to disk while `df.cache()`
keeps the results in memory (with potentially spilling to disk). Another
difference is that `cache` can be recomputed because it retains the logical
plan, whereas `checkpoint` rewrites the logical plan to point to the persisted
result.
I actually hadn't thought of it through this point of view, I essentially
mixed both into one issue to be honest. Part of the reason why I lean towards
re-using the shuffle mechanics was #1539, it would naturally fit with that and
would help survive executor interuptions. Now I realize that my approach above
basically **is** describing an equivalent to `checkpoint` rather than `cache`.
I am working on this a bit today and tomorrow, what are your thoughts on
this? Is the below a correct summary?
- `cache()`:
- Lazy computation
- Client retains the logical plan, just adds a `CacheNode` with a
`cache_id`
- Keeps results in executor's memory (potential spilling to disk)
- `checkpoint()`:
- Not lazy
- Returns a new df with new logical plan that points to the checkpointed
results (like being a table, essentially?)
- Writes results as shuffle files, scheduler registers it as new
`CheckPointTable`
Is this a more correct mental model?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]