eduardchai edited a comment on issue #18041:
URL: https://github.com/apache/airflow/issues/18041#issuecomment-1014385675
We started having this issue after we upgraded to v2.2.3. We did not
experience this issue when we were at v2.0.2.
Here is the sample dag that we used:
```
from datetime import datetime, timedelta
from airflow import DAG
from airflow.providers.cncf.kubernetes.operators.kubernetes_pod import (
KubernetesPodOperator,
)
with DAG(
dag_id="stress-test-kubepodoperator",
schedule_interval=None,
catchup=False,
start_date=datetime(2021, 1, 1),
) as dag:
for i in range(1000):
KubernetesPodOperator(
name="airflow-test-pod",
namespace="airflow-official",
image="ubuntu:latest",
cmds=["bash", "-cx"],
arguments=["r=$(( ( RANDOM % 120 ) + 1 )); sleep ${r}s"],
labels={"foo": "bar"},
task_id="task_" + str(i),
is_delete_operator_pod=True,
startup_timeout_seconds=300,
service_account_name="airflow-worker",
get_logs=True,
resources={"request_memory": "128Mi", "request_cpu": "100m"},
queue="kubernetes",
)
```
Error message:
```
[2022-01-17, 18:41:48 +08] {local_task_job.py:212} WARNING - State of this
instance has been externally set to scheduled. Terminating instance.
[2022-01-17, 18:41:48 +08] {process_utils.py:124} INFO - Sending
Signals.SIGTERM to group 16. PIDs of all processes in the group: [16]
[2022-01-17, 18:41:48 +08] {process_utils.py:75} INFO - Sending the signal
Signals.SIGTERM to group 16
[2022-01-17, 18:41:48 +08] {taskinstance.py:1408} ERROR - Received SIGTERM.
Terminating subprocesses.
[2022-01-17, 18:41:48 +08] {taskinstance.py:1700} ERROR - Task failed with
exception
```
Successful tasks were also intermittently flagged as failed:
<img width="1364" alt="Screenshot 2022-01-17 at 7 16 51 PM"
src="https://user-images.githubusercontent.com/16315447/149760309-a8c6d993-a32d-4bb0-990d-f9d8f76a259c.png">
```
[2022-01-17, 18:41:20 +08] {kubernetes_pod.py:372} INFO - creating pod with
labels {'dag_id': 'stress-test-kubepodoperator', 'task_id': 'task_44',
'execution_date': '2022-01-17T103621.2231870000-7740efddd', 'try_number': '1'}
and launcher <airflow.providers.cncf.kubernetes.utils.pod_launcher.PodLauncher
object at 0x7f569d311f90>
[2022-01-17, 18:41:20 +08] {pod_launcher.py:216} INFO - Event:
airflow-test-pod.c309a1eaf221470a882edf2cb57f9529 had an event of type Pending
[2022-01-17, 18:41:20 +08] {pod_launcher.py:133} WARNING - Pod not yet
started: airflow-test-pod.c309a1eaf221470a882edf2cb57f9529
[2022-01-17, 18:41:21 +08] {pod_launcher.py:216} INFO - Event:
airflow-test-pod.c309a1eaf221470a882edf2cb57f9529 had an event of type Pending
[2022-01-17, 18:41:21 +08] {pod_launcher.py:133} WARNING - Pod not yet
started: airflow-test-pod.c309a1eaf221470a882edf2cb57f9529
[2022-01-17, 18:41:22 +08] {pod_launcher.py:216} INFO - Event:
airflow-test-pod.c309a1eaf221470a882edf2cb57f9529 had an event of type Pending
[2022-01-17, 18:41:22 +08] {pod_launcher.py:133} WARNING - Pod not yet
started: airflow-test-pod.c309a1eaf221470a882edf2cb57f9529
[2022-01-17, 18:41:23 +08] {pod_launcher.py:216} INFO - Event:
airflow-test-pod.c309a1eaf221470a882edf2cb57f9529 had an event of type Running
[2022-01-17, 18:41:23 +08] {pod_launcher.py:159} INFO - + r=62
[2022-01-17, 18:41:23 +08] {pod_launcher.py:159} INFO - + sleep 62s
[2022-01-17, 18:42:25 +08] {pod_launcher.py:216} INFO - Event:
airflow-test-pod.c309a1eaf221470a882edf2cb57f9529 had an event of type Succeeded
[2022-01-17, 18:42:25 +08] {pod_launcher.py:333} INFO - Event with job id
airflow-test-pod.c309a1eaf221470a882edf2cb57f9529 Succeeded
[2022-01-17, 18:42:25 +08] {pod_launcher.py:216} INFO - Event:
airflow-test-pod.c309a1eaf221470a882edf2cb57f9529 had an event of type Succeeded
[2022-01-17, 18:42:25 +08] {pod_launcher.py:333} INFO - Event with job id
airflow-test-pod.c309a1eaf221470a882edf2cb57f9529 Succeeded
[2022-01-17, 18:42:25 +08] {taskinstance.py:1277} INFO - Marking task as
SUCCESS. dag_id=stress-test-kubepodoperator, task_id=task_44,
execution_date=20220117T103621, start_date=20220117T104119,
end_date=20220117T104225
[2022-01-17, 18:42:25 +08] {local_task_job.py:154} INFO - Task exited with
return code 0
[2022-01-17, 18:42:25 +08] {local_task_job.py:264} INFO - 0 downstream tasks
scheduled from follow-on schedule check
```
Environment information:
- deployed using official helm charts
- Airflow version: v2.2.3
- No. of scheduler replicas: 3 (each with 3 CPU and 6Gi memory request)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]