Github user ssaavedra commented on a diff in the pull request:
https://github.com/apache/spark/pull/22392#discussion_r218048682
--- Diff:
streaming/src/main/scala/org/apache/spark/streaming/Checkpoint.scala ---
@@ -54,6 +54,10 @@ class Checkpoint(ssc: StreamingContext, val
checkpointTime: Time)
"spark.driver.bindAddress",
"spark.driver.port",
"spark.master",
+ "spark.kubernetes.driver.pod.name",
+ "spark.kubernetes.driver.limit.cores",
+ "spark.kubernetes.executor.limit.cores",
--- End diff --
I'm not sure about the use-case there. I think I agree in general with
adding all of the request/limit knobs, because the cluster may have changed
(e.g., a "core" is now a latest-gen Xeon instead of a prev-gen Celeron), or the
job might have starved due to resource limits. However, I'm not sure that the
general job settings should need tweaking when reloading a job from a
checkpoint.
My rationale is that this should only happen when a job would have been,
otherwise, just been running, with no opportunity for you to tweak such
settings. I'd argue that if you need that fine-tuning, you should want to
perform a new deployment of your job, instead of re-launching the previous one.
You need to fix variables that refer to those non-deterministic choices that
the deployment process does (such as choosing Pod names, IP addresses and so
on), but I'd say the rest of the config flags should be unaffected.
In particular, about the python and R-related configurations, none of those
should need to be changed in this case, since that would mean you are actually
changing the operations you are going to perform, or the Python version.
I can add the missing resource request/limit flags or remove them
completely, I'm not sure what's the better approach. In any case, I think a
further discussion in spark-dev should settle this mist, but if you want
checkpointing to be ready for 2.4.0 (we are already late), I'd go with what's
needed for it to work and follow-up later with more advanced use cases, but I'm
open to alternatives. Also, for those flags that are not really
cluster-configuration-related, being able to change those flags falls out of
scope here, and instead that should lead to a discussion about whether
restoring a job from a checkpoint should allow such job to carry different
run-time semantics.
Looking forward to a solution :)
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]