jscheffl opened a new pull request, #63489: URL: https://github.com/apache/airflow/pull/63489
As we have many Dags with different priorities and use a log of Deferred Tasks we noticed that sometimes tasks returning from triggerer back to worker are stuck for a long time. This can happen if a Dag with higher priority is started after a task has been deferred. Then on return other tasks are scheduled prior a returning task. This is bad for example in our case where we use KPO as the Pods are completed but still allocate disk space, might get lost if Autoscaler decides to turn down a node. Then the Xcom information gets lost as the XCom side-car is still active waiting to have the task active again on the worker to fetch Xcom and delete the Pod. We considered the following options: - Accept the situation... no, this is bothering since a long time and we lose a lot of Xcom results and users sometimes wait long on results. Also we had situations where newer high priority workload could be scheduled as nodes were still allocated with Pod leftovers waiting on deletion but deletion was not happening because lower priority. - Implement a "hack" and increase priority of tasks returning from Triggerer. But then if the task actually failed the priority would need to be reset again. - Ensure that tasks that return from Triggerer are not scheduled again (risk of being delayed) but pushing directly to executor queue --> this PR! I'd like to propose to add this as a new feature into 3.2.0. We have patched this locally into 3.1.7 and results show that (1) latency for cleanup in Deferred KPO is very much improved as well as (2) efforts on scheduling are reduced. Trade-off might be that there might be situations where more tasks than allowed are "running" if deferred tasks are not counted into pools. Therefore the config is not enabled per default and is therefore an opt-in. FYI @AutomationDev85 @dabla --- ##### Was generative AI tooling used to co-author this PR? <!-- If generative AI tooling has been used in the process of authoring this PR, please change below checkbox to `[X]` followed by the name of the tool, uncomment the "Generated-by". --> - [ ] Yes (please specify the tool below) Claude 4.6 <!-- Generated-by: [Tool Name] following [the guidelines](https://github.com/apache/airflow/blob/main/contributing-docs/05_pull_requests.rst#gen-ai-assisted-contributions) --> --- * Read the **[Pull Request Guidelines](https://github.com/apache/airflow/blob/main/contributing-docs/05_pull_requests.rst#pull-request-guidelines)** for more information. Note: commit author/co-author name and email in commits become permanently public when merged. * For fundamental code changes, an Airflow Improvement Proposal ([AIP](https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvement+Proposals)) is needed. * When adding dependency, check compliance with the [ASF 3rd Party License Policy](https://www.apache.org/legal/resolved.html#category-x). * For significant user-facing changes create newsfragment: `{pr_number}.significant.rst`, in [airflow-core/newsfragments](https://github.com/apache/airflow/tree/main/airflow-core/newsfragments). You can add this file in a follow-up commit after the PR is created so you know the PR number. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
