jscheffl opened a new pull request, #63489:
URL: https://github.com/apache/airflow/pull/63489

   As we have many Dags with different priorities and use a log of Deferred 
Tasks we noticed that sometimes tasks returning from triggerer back to worker 
are stuck for a long time. This can happen if a Dag with higher priority is 
started after a task has been deferred. Then on return other tasks are 
scheduled prior a returning task.
   
   This is bad for example in our case where we use KPO as the Pods are 
completed but still allocate disk space, might get lost if Autoscaler decides 
to turn down a node. Then the Xcom information gets lost as the XCom side-car 
is still active waiting to have the task active again on the worker to fetch 
Xcom and delete the Pod.
   
   We considered the following options:
   - Accept the situation... no, this is bothering since a long time and we 
lose a lot of Xcom results and users sometimes wait long on results. Also we 
had situations where newer high priority workload could be scheduled as nodes 
were still allocated with Pod leftovers waiting on deletion but deletion was 
not happening because lower priority.
   - Implement a "hack" and increase priority of tasks returning from 
Triggerer. But then if the task actually failed the priority would need to be 
reset again.
   - Ensure that tasks that return from Triggerer are not scheduled again (risk 
of being delayed) but pushing directly to executor queue --> this PR!
   
   I'd like to propose to add this as a new feature into 3.2.0. We have patched 
this locally into 3.1.7 and results show that (1) latency for cleanup in 
Deferred KPO is very much improved as well as (2) efforts on scheduling are 
reduced.
   
   Trade-off might be that there might be situations where more tasks than 
allowed are "running" if deferred tasks are not counted into pools. Therefore 
the config is not enabled per default and is therefore an opt-in.
   
   FYI @AutomationDev85 @dabla 
   
   ---
   
   ##### Was generative AI tooling used to co-author this PR?
   
   <!--
   If generative AI tooling has been used in the process of authoring this PR, 
please
   change below checkbox to `[X]` followed by the name of the tool, uncomment 
the "Generated-by".
   -->
   
   - [ ] Yes (please specify the tool below)
   Claude 4.6
   
   <!--
   Generated-by: [Tool Name] following [the 
guidelines](https://github.com/apache/airflow/blob/main/contributing-docs/05_pull_requests.rst#gen-ai-assisted-contributions)
   -->
   
   ---
   
   * Read the **[Pull Request 
Guidelines](https://github.com/apache/airflow/blob/main/contributing-docs/05_pull_requests.rst#pull-request-guidelines)**
 for more information. Note: commit author/co-author name and email in commits 
become permanently public when merged.
   * For fundamental code changes, an Airflow Improvement Proposal 
([AIP](https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvement+Proposals))
 is needed.
   * When adding dependency, check compliance with the [ASF 3rd Party License 
Policy](https://www.apache.org/legal/resolved.html#category-x).
   * For significant user-facing changes create newsfragment: 
`{pr_number}.significant.rst`, in 
[airflow-core/newsfragments](https://github.com/apache/airflow/tree/main/airflow-core/newsfragments).
 You can add this file in a follow-up commit after the PR is created so you 
know the PR number.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to