GitHub user kskalski edited a discussion: Starting several EksPodOperator tasks
clogs the worker making it slow/unresponsive to liveness probe
My setup is as follows:
* running Google composer (composer-2.9.11-airflow-2.10.2)
* the DAG starts several (e.g. 10) tasks using EksPodOperator on certain hour
* composer runs in US region, while the tasks are started in AWS Asia region
(thus in theory there is a bit of delay/slowness in communication)
The issue I'm observing is that workers get restarted due to liveness probe in
composer set-up (maybe it is a composer-specific configuration, so far I tried
scaling up the environment, but problem keeps popping up) after they are
starting to execute the pod creation task.
The log line is
```
[2024-11-19, 01:52:42 UTC] {connection_wrapper.py:325} INFO - AWS Connection
(conn_id='aws_g', conn_type='aws') credentials retrieved from login and
password.
[2024-11-19, 01:52:45 UTC] {baseoperator.py:405} WARNING -
EksPodOperator.execute cannot be called outside TaskInstance!
[2024-11-19, 01:52:45 UTC] {pod.py:1139} INFO - Building pod
pcap-parse-ciavnb0t with labels: {'dag_id': 'pcap', 'task_id': 'parse.m0',
'run_id': 'scheduled__2024-11-18T0048000000-1fa6cc691',
'kubernetes_pod_operator': 'True', 'try_number': '4'}
```
after which the the worker gets killed, all existing running tasks are set to
failed (they manage to re-claim the running pods if there are remaining
attempts and the new attempt task get to run again).
When tasks get started in a slower fashion (e.g. one after another in delay of
minutes), it seems to behave more stable, so this is likely just resource
exhaustion on the worker, however I'm puzzled by how quickly it runs bad:
- tried bumping available workers from 1 to 2
- adding more cpu to the worker
- switching composer environment to medium (in case this is related to some
other ops being done e.g. on database)
It looks like starting something like >3 `EksPodOperator` tasks at the same
moment on the worker makes those tasks stuck / extremely slow and taking the
whole worker out.
I'm looking into suggestions if:
- there is a way to limit concurrent *starting* of tasks (it seems that after
the tasks get started successfully, it behaves mostly stable), as I would like
to have the limit of concurrently *running* tasks high
- this is in fact just CPU limit issue (at the time this happens the worker is
clearly getting higher CPU usage, but still below like 60% of limit) and I
should keep adding more cpu to worker(s)
- `EksPodOperator` is doing something wrong / deadlock (?) due to several of
them starting on the same worker
- there are some other configuration knobs I could try
GitHub link: https://github.com/apache/airflow/discussions/44169
----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]