agrawaldevesh opened a new pull request #35005:
URL: https://github.com/apache/spark/pull/35005
Run checkpoint job only once when asked to do so eagerly.
### What changes were proposed in this pull request?
The flow is like so:
```
- df.checkpoint(eager = true, reliable = true)
- rdd = get rdd from this df's physical plan
- rdd.checkpoint (just marks checkpointData)
- rdd.count (if eager = true)
- SparkContext.runJob for all the partitions
- DAGScheduler.runJob
- rdd.doCheckpoint
- ReliableCheckpointRDD#writeRDDToCheckpointDirectory
- SparkContext.runJob for all the partitions
- DAGScheduler.runJob (<-- This is the repeat job)
```
The local checkpointing case is better because there it will just
recompute the missing partitions.
We tried a fix where we just replaced `rdd.count` above with
`rdd.doCheckpoint` and it seemed to work and pass the unit tests.
So the new flow is simply:
```
- df.checkpoint(eager = true, reliable = true)
- rdd = get rdd from this df's physical plan
- rdd.checkpoint (just marks checkpointData)
- rdd.doCheckpoint (if eager = true)
- ReliableCheckpointRDD#writeRDDToCheckpointDirectory
- SparkContext.runJob for all the partitions
- DAGScheduler.runJob (<-- Only one job is run)
```
### Why are the changes needed?
This simple fix drastically improves spark jobs that make heavy use of
Dataframe.checkpoint.
### Does this PR introduce _any_ user-facing change?
Yes, it would make eager checkpointing jobs supposedly faster by doing half
as many spark jobs.
### How was this patch tested?
Customer spark apps using checkpoint with this fix see half as many
spark jobs launched, seeing upto 50% less runtime in some cases.
Also, added one more unit test to check that only job is created.
This patch may have some interactions with the Spark-Streaming, since it
touches the codepaths enabled via the config
spark.checkpoint.checkpointAllMarkedAncestors, so would be happy to add
more testing there if pointed in the right direction.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]