GitHub user ywmvis edited a discussion: How to handle long running tasks with the Kubernetes Operator ?
Hi, we use Kubernetes to run long running tasks in airflow. The current behavior is that tasks get queued as soon as the DAG preconditions are met. Since we have tasks with a varying duration between a few minutes and in worst case (depending on the task and data) several days and only limited resources, tasks are often failing as "stuck in queued as failed" because Kubernetes resources are exhausted for a long period of time while the queue gets filled up. For us it would be no issue if the tasks are just waiting in the queue until they can be picked up by Kubernetes and they should not fail just because there are no resources for a specific period of time. Is there a way to handle such a scenario without causing side effects ? Stuck in queued because there are currently no resources left ---> good case, task should not fail and just be picked up when resources become available again Stuck in queued because of non resources issues (e.g. kubernetes crashed, api not reachable, ....) the task should still fail It seems there is the "task_queued_timeout" parameter which is causing the failed tasks. We could now increase the timeout to a really big number to prevent the tasks from failing but we are not sure if we then prevent airflow or the scheduler from detecting tasks which are really stuck in the queue. Any recommendation how we could work around this issue ? GitHub link: https://github.com/apache/airflow/discussions/45503 ---- This is an automatically sent email for [email protected]. To unsubscribe, please send an email to: [email protected]
