SameerMesiah97 commented on PR #60717: URL: https://github.com/apache/airflow/pull/60717#issuecomment-3765325623
> As if an airflow scheduler interruption? As I don't see how this may happen, worker interruption seems more possible, however I don't see when it can happen, let's say for some reason it did exit, the pod is killed yet the driver stays up until it is finished, the task is marked as failed, if we have retries, won't it just create a new spark application and update the try number label? Or if we are in a different dagrun, the runid will be different. I admit this is a rare edge case. The scenario I am referring to is one where the scheduler might get OOM killed and this prevents the task state from being updated. So on scheduler restart, the task will resume from the same try number and look for an existing driver pod which has a label with the exact same context variables as before the crash. > The Spark application as if the CRD? Doesn't it monitor the driver? Or basically deploy and manage the driver? I think I might be missing something, or misunderstanding something. Yes. The SparkApplication is the owner of the driver pod. The `SparkKubernetesOpeator` merely submits the SparkApplication, which then handles the lifecycle of the driver pod. I believe this would occur in a very narrow window between the submission of the SparkApplication and the execution of `find_spark_job` during which the aforementioned scheduler crash could occur. Consider the following timeline: 1. Worker submits SparkApplication 2. SparkApplication creates driver pod 3. Scheduler crashes (e.g. OOM) 4. SparkApplication later creates a second driver pod (with the same label) 5. Scheduler restarts 6. Worker runs `find_spark_job` 7. `find_spark_job` sees two matching pods > What if I do want to fail the task? As if I am truncating a table which may not be an atomic operation, and I use airflow expecting there will be no more than 1 run, and so I want to be notified on failure. I believe this point is somewhat tangential to deterministic pod selection. This is not about multiple task runs but failsafe pod recovery in a very specific scenario. And if you want to fail the task under certain conditions, using pod discovery as a trigger for task success or failure is brittle as there can be multiple pods per task. It would be better to use something else such as exit code or return value for the operation as that is more explicit. If we take your table truncation example, that would be the failure/success event fired by the DB. Also, under this new implementation, users will be notified when there are multiple pods. There just won't be a hard failure like before. I do admit that I may have downplayed this in the PR description. This is a significant behavior change as mutliple pods may lead to soft recovery instead of hard failure. But that is because this does not indicate anything critically wrong within Airflow but due to the idiosyncracies with how a SparkApplication handles pod reconci;iation and creation. > Is there a way to reproduce the issue stated above? It's possible but very difficult to do so deterministically. If you look at the existing condition, it appears to imply that multiple pods with the same label being picked up is a plausible scenario. This is what initially motivated me to produce this PR. > And isn't there a case where the creation timestamp is the wrong pod? I.e the older retry pod was stuck in pending, so the older one got a newer created at timestamp, can this cause any issues? I agree. I think the approach needs to be modified so that it prioritizes running pods with `creation_timestamp` being a tie-breaker. This is a very valid edge case that I overlooked. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
