Github user ssaavedra commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22392#discussion_r218048682
  
    --- Diff: 
streaming/src/main/scala/org/apache/spark/streaming/Checkpoint.scala ---
    @@ -54,6 +54,10 @@ class Checkpoint(ssc: StreamingContext, val 
checkpointTime: Time)
           "spark.driver.bindAddress",
           "spark.driver.port",
           "spark.master",
    +      "spark.kubernetes.driver.pod.name",
    +      "spark.kubernetes.driver.limit.cores",
    +      "spark.kubernetes.executor.limit.cores",
    --- End diff --
    
    I'm not sure about the use-case there. I think I agree in general with 
adding all of the request/limit knobs, because the cluster may have changed 
(e.g., a "core" is now a latest-gen Xeon instead of a prev-gen Celeron), or the 
job might have starved due to resource limits. However, I'm not sure that the 
general job settings should need tweaking when reloading a job from a 
checkpoint.
    
    My rationale is that this should only happen when a job would have been, 
otherwise, just been running, with no opportunity for you to tweak such 
settings. I'd argue that if you need that fine-tuning, you should want to 
perform a new deployment of your job, instead of re-launching the previous one. 
You need to fix variables that refer to those non-deterministic choices that 
the deployment process does (such as choosing Pod names, IP addresses and so 
on), but I'd say the rest of the config flags should be unaffected.
    
    In particular, about the python and R-related configurations, none of those 
should need to be changed in this case, since that would mean you are actually 
changing the operations you are going to perform, or the Python version.
    
    I can add the missing resource request/limit flags or remove them 
completely, I'm not sure what's the better approach. In any case, I think a 
further discussion in spark-dev should settle this mist, but if you want 
checkpointing to be ready for 2.4.0 (we are already late), I'd go with what's 
needed for it to work and follow-up later with more advanced use cases, but I'm 
open to alternatives. Also, for those flags that are not really 
cluster-configuration-related, being able to change those flags falls out of 
scope here, and instead that should lead to a discussion about whether 
restoring a job from a checkpoint should allow such job to carry different 
run-time semantics.
    
    Looking forward to a solution :)


---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to