GitHub user kskalski edited a discussion: Starting several EksPodOperator tasks 
clogs the worker making it slow/unresponsive to liveness probe

My setup is as follows:
* running Google composer (composer-2.9.11-airflow-2.10.2) 
* the DAG starts several (e.g. 10) tasks using EksPodOperator on certain hour
* composer runs in US region, while the tasks are started in AWS Asia region 
(thus in theory there is a bit of delay/slowness in communication)

The issue I'm observing is that workers get restarted due to liveness probe in 
composer set-up (maybe it is a composer-specific configuration, so far I tried 
scaling up the environment, but problem keeps popping up) after they are 
starting to execute the pod creation task.

The log line is
```
[2024-11-19, 01:52:42 UTC] {connection_wrapper.py:325} INFO - AWS Connection 
(conn_id='aws_g', conn_type='aws') credentials retrieved from login and 
password.
[2024-11-19, 01:52:45 UTC] {baseoperator.py:405} WARNING - 
EksPodOperator.execute cannot be called outside TaskInstance!
[2024-11-19, 01:52:45 UTC] {pod.py:1139} INFO - Building pod 
pcap-parse-ciavnb0t with labels: {'dag_id': 'pcap', 'task_id': 'parse.m0', 
'run_id': 'scheduled__2024-11-18T0048000000-1fa6cc691', 
'kubernetes_pod_operator': 'True', 'try_number': '4'}
```
after which the the worker gets killed, all existing running tasks are set to 
failed (they manage to re-claim the running pods if there are remaining 
attempts and the new attempt task get to run again).

When tasks get started in a slower fashion (e.g. one after another in delay of 
minutes), it seems to behave more stable, so this is likely just resource 
exhaustion on the worker, however I'm puzzled by how quickly it runs bad:
- tried bumping available workers from 1 to 2
- adding more cpu to the worker
- switching composer environment to medium (in case this is related to some 
other ops being done e.g. on database)

It looks like starting something like >3 `EksPodOperator` tasks at the same 
moment on the worker makes those tasks stuck / extremely slow and taking the 
whole worker out.

I'm looking into suggestions if:
- there is a way to limit concurrent *starting* of tasks (it seems that after 
the tasks get started successfully, it behaves mostly stable), as I would like 
to have the limit of concurrently *running* tasks high
- this is in fact just CPU limit issue (at the time this happens the worker is 
clearly getting higher CPU usage, but still below like 60% of limit) and I 
should keep adding more cpu to worker(s)
- `EksPodOperator` is doing something wrong / deadlock (?) due to several of 
them starting on the same worker
- there are some other configuration knobs I could try

GitHub link: https://github.com/apache/airflow/discussions/44169

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]

Reply via email to