[GitHub] [airflow] jedcunningham commented on a diff in pull request #30375: Simplify logic to resolve tasks stuck in queued despite stalled_task_timeout

via GitHub Thu, 13 Apr 2023 14:42:53 -0700


jedcunningham commented on code in PR #30375:
URL: https://github.com/apache/airflow/pull/30375#discussion_r1166052103



##########
newsfragments/30375.significant.rst:
##########
@@ -0,0 +1,9 @@
+Simplify logic to resolve tasks stuck in queued despite stalled_task_timeout

Review Comment:
   ```suggestion
   Consolidate handling of tasks stuck in queued under new 
``task_queued_timeout`` config
   ```
   
   This is probably a better title?



##########
newsfragments/30375.significant.rst:
##########
@@ -0,0 +1,9 @@
+Simplify logic to resolve tasks stuck in queued despite stalled_task_timeout
+
+Logic for handling tasks stuck in the queued state has been consolidated, and 
the all configurations 
+responsible for timing out stuck queued tasks have been deprecated and merged 
into 
+`scheduler.task_queued_timeout`. The configurations that have been deprecated 
are 
+`kubernetes.worker_pods_pending_timeout`, `celery.stalled_task_timeout`, and 
+`celery.task_adoption_timeout`. If any of these configurations are set, the 
longest timeout will be
+respected. For example, if `celery.stalled_task_timeout` is 1200, and 
`scheduler.task_queued_timeout` 
+is 600, Airflow will set `scheduler.task_queued_timeout` to 1200.

Review Comment:
   ```suggestion
   Logic for handling tasks stuck in the queued state has been consolidated, 
and the all configurations
   responsible for timing out stuck queued tasks have been deprecated and 
merged into
   ``scheduler.task_queued_timeout``. The configurations that have been 
deprecated are
   ``kubernetes.worker_pods_pending_timeout``, ``celery.stalled_task_timeout``, 
and
   ``celery.task_adoption_timeout``. If any of these configurations are set, 
the longest timeout will be
   respected. For example, if ``celery.stalled_task_timeout`` is 1200, and 
``scheduler.task_queued_timeout``
   is 600, Airflow will set ``scheduler.task_queued_timeout`` to 1200.
   
   ```
   
   Fixes static checks.



##########
newsfragments/30375.significant.rst:
##########
@@ -0,0 +1,9 @@
+Simplify logic to resolve tasks stuck in queued despite stalled_task_timeout
+
+Logic for handling tasks stuck in the queued state has been consolidated, and 
the all configurations 
+responsible for timing out stuck queued tasks have been deprecated and merged 
into 
+`scheduler.task_queued_timeout`. The configurations that have been deprecated 
are 

Review Comment:
   ```suggestion
   `[scheduler] task_queued_timeout`. The configurations that have been 
deprecated are 
   ```
   
   Let's reference these config options like this instead.



##########
airflow/executors/base_executor.py:
##########
@@ -376,6 +376,18 @@ def terminate(self):
         """This method is called when the daemon receives a SIGTERM."""
         raise NotImplementedError()
 
+    def cleanup_stuck_queued_tasks(self, tis: list[TaskInstance]) -> 
list[str]:  # pragma: no cover
+        """
+        Handle remnants of tasks that were failed because they were stuck in 
queued.
+        Tasks can get stuck in queued. If such a task is detected, it will be 
marked
+        as `UP_FOR_RETRY` if the task instance has remaining retries or marked 
as `FAILED`
+        if it doesn't.
+
+        :param tis: List of Task Instances to clean up
+        :return: List of readable task instances for a warning message
+        """
+        raise NotImplementedError()

Review Comment:
   I just realized we haven't properly handled what happens when another 
executor hits this (however unlikely).
   
   Easy way to reproduce this would be setting a really low 
task_queued_timeout, and add a longer sleep to the top level of a DAG, when 
using LocalExecutor.
   
   Maybe log this should just call fail also? It shouldn't kill the scheduler 
I'd think.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [airflow] jedcunningham commented on a diff in pull request #30375: Simplify logic to resolve tasks stuck in queued despite stalled_task_timeout

Reply via email to