GitHub user ywmvis edited a discussion: How to handle long running tasks with 
the Kubernetes Operator ?

Hi,

we use Kubernetes to run long running tasks in airflow.
The current behavior is that tasks get queued as soon as the DAG preconditions 
are met.

Since we have tasks with a varying duration between a few minutes and in worst 
case (depending on the task and data) several days and only limited resources, 
tasks are often failing as "stuck in queued as failed" because Kubernetes 
resources are exhausted for a long period of time while the queue gets filled 
up.

For us it would be no issue if the tasks are just waiting in the queue until 
they can be picked up by Kubernetes and they should not fail just because there 
are no resources for a specific period of time.

Is there a way to handle such a scenario without causing side effects ?

Stuck in queued because there are currently no resources left ---> good case, 
task should not fail and just be picked up when resources become available again

Stuck in queued because of non resources issues (e.g. kubernetes crashed, api 
not reachable, ....) the task should still fail

It seems there is the "task_queued_timeout" parameter which is causing the 
failed tasks. We could now increase the timeout to a really big number to 
prevent the tasks from failing but we are not sure if we then prevent airflow 
or the scheduler from detecting tasks which are really stuck in the queue.

Any recommendation how we could work around this issue ?

 

 


GitHub link: https://github.com/apache/airflow/discussions/45503

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]

Reply via email to