wolvery commented on PR #59413: URL: https://github.com/apache/airflow/pull/59413#issuecomment-3662769256
> @wolvery > > Thanks for opening a PR to address this issue. Had I known you were already working on a solution, I would have deferred to you, as you naturally understand your use case better than I do. That said, I am concerned that changing the behaviour of an existing cleanup mode could introduce a risk of breaking current deployments. > > My understanding of the intended semantics of the cleanup modes is: > > 1. `delete_pod`: The most aggressive option, and the default. This is best suited for “happy path” scenarios where there is no expectation of needing pods for post-mortem debugging. > 2. `delete_succeeded_pod`: Cleans up pods that have completed successfully and whose logs are no longer required. These pods have run to completion without issues and no longer need to remain in the namespace. > 3. `keep_pod`: Retains all pod objects. While I would not personally recommend this in most cases, it’s provide users with maximum flexibility. > 4. `delete_active_pod`: A new mode I introduced based on my interpretation of your issue. This removes pods that are consuming (or may consume) resources—i.e. pods in RUNNING or PENDING states—while retaining completed pods for debugging. This avoids scenarios where resource-heavy pods continue running after the task has terminated, without losing succeeded/failed pods that may be useful for investigation. If my understanding is correct, this should cover your use case: retaining completed pods for debugging while cleaning up resource-hogging ones. > > We could alternatively change the behaviour of `delete_succeeded_pod`, but that would require redefining what the flag represents. Doing so risks breaking existing deployments and effectively deprecates the current semantics. Keeping the flag name while changing its behaviour would also be confusing: a mode called `delete_succeeded_pod `would then delete running and pending pods as well. While this may seem pedantic, many users will not read docstrings closely and could inadvertently choose a setting that deletes pods they intended to keep. > > If you are insistent on requiring a mode that deletes all pods save for the ones that have not failed, I believe it would be better to add a new flag for this with an appropriate name such as `delete_non_failed_pod`. But would this new flag provide additional useful functionality that `delete_active_pod` does not? I understand you might want succeeded pods to be gone with the running/pending pods but I am not sure of what purpose that would serve? In post-mortem debugging, I believe it would be more useful to have logs for both successful and failing states rather than just the latter. > > I think it would be best to have someone who owns this provider look at this and offer their comments. Even though this is a seemingly trivial change, it has subtle risks that require consideration. That being said I am happy that you raised the issue. The absence of a specific flag to delete running/pending pods on task termination was definitely a gap that needed to be addressed. > > @hussein-awala @jedcunningham I was clear about the issue I raised: we had pods running for long periods without Airflow Workers monitoring them. In my opinion, we expected that the Airflow Worker would not keep these kinds of pods at all. This led to increased costs, and in our situation we also had resource constraints in the cluster. As a result, we reached the cluster limits, which affected the entire Airflow environment. We need to remove successful pods because we have components that read pods at startup, and not removing them is critical. I don’t see value in having another option, since we already have one for keeping failed pods. I don’t understand why we should keep this behavior and let others face the same issue. I opened this issue based on an incident we experienced. Other people might face the same problem if they rely on the KubernetesPodOperator. We are here to contribute, so even if you haven’t experienced the same incident, let’s try to have a mindset of being helpful so others don’t run into the same issues. Sometimes we are not able to properly review PRs, and I apologize for that. However, I opened this issue and this PR because I believe this is the correct approach. All details are in the PR comment, so, take your time and read it. You can read the issue, again. You already made your comments on the PR, but I disagree with the proposed solution that you raised. Please, I don't see reason to have a long talk on this topic. The issue lies on this option and we should build a strong tool for others. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
