wolvery commented on PR #59413:
URL: https://github.com/apache/airflow/pull/59413#issuecomment-3662769256

   > @wolvery
   > 
   > Thanks for opening a PR to address this issue. Had I known you were 
already working on a solution, I would have deferred to you, as you naturally 
understand your use case better than I do. That said, I am concerned that 
changing the behaviour of an existing cleanup mode could introduce a risk of 
breaking current deployments.
   > 
   > My understanding of the intended semantics of the cleanup modes is:
   > 
   > 1. `delete_pod`: The most aggressive option, and the default. This is best 
suited for “happy path” scenarios where there is no expectation of needing pods 
for post-mortem debugging.
   > 2. `delete_succeeded_pod`: Cleans up pods that have completed successfully 
and whose logs are no longer required. These pods have run to completion 
without issues and no longer need to remain in the namespace.
   > 3. `keep_pod`: Retains all pod objects. While I would not personally 
recommend this in most cases, it’s provide users with maximum flexibility.
   > 4. `delete_active_pod`: A new mode I introduced based on my interpretation 
of your issue. This removes pods that are consuming (or may consume) 
resources—i.e. pods in RUNNING or PENDING states—while retaining completed pods 
for debugging. This avoids scenarios where resource-heavy pods continue running 
after the task has terminated, without losing succeeded/failed pods that may be 
useful for investigation. If my understanding is correct, this should cover 
your use case: retaining completed pods for debugging while cleaning up 
resource-hogging ones.
   > 
   > We could alternatively change the behaviour of `delete_succeeded_pod`, but 
that would require redefining what the flag represents. Doing so risks breaking 
existing deployments and effectively deprecates the current semantics. Keeping 
the flag name while changing its behaviour would also be confusing: a mode 
called `delete_succeeded_pod `would then delete running and pending pods as 
well. While this may seem pedantic, many users will not read docstrings closely 
and could inadvertently choose a setting that deletes pods they intended to 
keep.
   > 
   > If you are insistent on requiring a mode that deletes all pods save for 
the ones that have not failed, I believe it would be better to add a new flag 
for this with an appropriate name such as `delete_non_failed_pod`. But would 
this new flag provide additional useful functionality that `delete_active_pod` 
does not? I understand you might want succeeded pods to be gone with the  
running/pending pods but I am not sure of what purpose that would serve? In 
post-mortem debugging, I believe it would be more useful to have logs for both 
successful and failing states rather than just the latter.
   > 
   > I think it would be best to have someone who owns this provider look at 
this and offer their comments. Even though this is a seemingly trivial change, 
it has subtle risks that require consideration. That being said I am happy that 
you raised the issue. The absence of a specific flag to delete running/pending 
pods on task termination was definitely a gap that needed to be addressed.
   > 
   > @hussein-awala @jedcunningham 
   
   I was clear about the issue I raised: we had pods running for long periods 
without Airflow Workers monitoring them. In my opinion, we expected that the 
Airflow Worker would not keep these kinds of pods at all. This led to increased 
costs, and in our situation we also had resource constraints in the cluster. As 
a result, we reached the cluster limits, which affected the entire Airflow 
environment.
   
   We need to remove successful pods because we have components that read pods 
at startup, and not removing them is critical. I don’t see value in having 
another option, since we already have one for keeping failed pods.
   
   I don’t understand why we should keep this behavior and let others face the 
same issue. I opened this issue based on an incident we experienced. Other 
people might face the same problem if they rely on the KubernetesPodOperator. 
We are here to contribute, so even if you haven’t experienced the same 
incident, let’s try to have a mindset of being helpful so others don’t run into 
the same issues.
   
   Sometimes we are not able to properly review PRs, and I apologize for that. 
However, I opened this issue and this PR because I believe this is the correct 
approach. All details are in the PR comment, so, take your time and read it. 
You can read the issue, again. You already made your comments on the PR, but I 
disagree with the proposed solution that you raised. Please, I don't see reason 
to have a long talk on this topic. The issue lies on this option and we should 
build a strong tool for others.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to