Github user szhem commented on the issue:
https://github.com/apache/spark/pull/19410
@mallman
Just my two cents regarding built-in solutions:
Periodic checkpointer deletes checkpoint files not to pollute the hard
drive. Although disk storage is cheap it's not free.
For example, in my case (graph with >1B vertices and about the same amount
of edges) checkpoint directory with a single checkpoint took about 150-200GB.
Checkpoint interval was set to 5, and then job was able to complete in
about 100 iterations.
So in case of not cleaning up unnecessary checkpoints, the checkpoint
directory could grow up to 6TB (which is quite a lot) in my case.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]