Nataneljpwd commented on PR #61110:
URL: https://github.com/apache/airflow/pull/61110#issuecomment-3833405080

   Hello, thank you for the comment!
   
   > @Nataneljpwd
   > 
   > I still have two blocking concerns here:
   > 
   > 1. I don’t think the quorum-read argument holds. Not setting 
`resourceVersion` may require a quorum read , but that only guarantees 
freshness of what the API server returns. It doesn’t mean we’re reading 
directly from etcd, and more importantly it doesn’t enforce any stronger 
invariants (like uniqueness or lifecycle ordering).
   
   This is a valid concern, though from reading the k8s api documentation, not 
setting a `resourceVersion` 
[_requires_](https://kubernetes.io/docs/reference/using-api/api-concepts/#:~:text=Unless%20you%20have,to%20be%20served.)
 a quorum read to be served, it does mean that the API server validates with 
etcd before returning a result, and so it seems to achieve the same result.
   About the invarients, I understand the concern, I have explained in the code 
in a comment why we do not care about the ordering, and why we won't get more 
than 1 result.
   
   > 2. Even if the API response reflects the latest etcd state, Kubernetes 
does not guarantee that there can’t be multiple pods with the same labels in 
`Pending` or `Running` at the same time. I understand the SparkApplication 
controller intends there to be a single driver, but that guarantee is eventual. 
During restarts or reconciliation, overlapping driver pods are possible and 
expected. Your filter for deletion_timestamp partially mitigates against this 
by exlcuding terminating pods but I suspect there can still be windows where 
the replacement pod is created before the deletion for the existing pod is 
triggered.
   
   Yes, though the Spark kubernetes operator which handles the spark crd makes 
sure there is only 1 driver up at any given time, the transition does not 
happen, as before a pod is created, the old pods are deleted, as written 
[here](https://www.kubeflow.org/docs/components/spark-operator/overview/#:~:text=When%20the%20operator%20decides%20to,create%20a%20new%20driver%20pod.).
   
   > Because of that, removing the existing phase-based prioritization is 
risky. I see that you are now prioritizing 'Succeeded' pods but since you have 
removed the ordering for `Running` and `Pending` pods, this does not mitigate 
the concerns I explained above.
   
   Don't we want to prioritize Succeeded pods? as there is no case in which a 
driver for an application will be running while there is a pod in `Succeeded` 
state, I will add back the creation time check, as it is something I most 
likelly deleted on accident.
   
   > I would advise you to implement something like this:
   > 
   > ```
   > pod = max(
   >     pod_list,
   >     key=lambda p: (
   >     p.metadata.deletion_timestamp is None,
   >     p.status.phase == PodPhase.RUNNING,
   >     p.status.phase == PodPhase.SUCCEEDED,
   >     p.metadata.creation_timestamp or 
datetime.min.replace(tzinfo=timezone.utc),
   >     p.metadata.name or "",
   >      ),
   > )
   > ```
   > 
   > This handles the edge cases revealed by this PR without accidentally 
returning multiple pods. I would advise removing the field selector as well 
because we are already filtering in the lambda expression.
   
   We can leave the field selector, as we do not need to implicitly choose 
Running pods, as a running pod can only occur if it is terminating, meaning 
that deletion timestamp is not None, in addition to not having the RUNNING 
preference, as it means we might select a terminating pod before a newly 
created pod.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to