eduardchai edited a comment on issue #18041:
URL: https://github.com/apache/airflow/issues/18041#issuecomment-1014385675


   We started having this issue after we upgraded to v2.2.3. We did not 
experience this issue when we were at v2.0.2.
   
   Here is the sample dag that we used:
   
   ```
   from datetime import datetime, timedelta
   
   from airflow import DAG
   from airflow.providers.cncf.kubernetes.operators.kubernetes_pod import (
       KubernetesPodOperator,
   )
   
   with DAG(
       dag_id="stress-test-kubepodoperator",
       schedule_interval=None,
       catchup=False,
       start_date=datetime(2021, 1, 1),
   ) as dag:
       for i in range(1000):
           KubernetesPodOperator(
               name="airflow-test-pod",
               namespace="airflow-official",
               image="ubuntu:latest",
               cmds=["bash", "-cx"],
               arguments=["r=$(( ( RANDOM % 120 )  + 1 )); sleep ${r}s"],
               labels={"foo": "bar"},
               task_id="task_" + str(i),
               is_delete_operator_pod=True,
               startup_timeout_seconds=300,
               service_account_name="airflow-worker",
               get_logs=True,
               resources={"request_memory": "128Mi", "request_cpu": "100m"},
               queue="kubernetes",
           )
   ```
   
   Error message:
   ```
   [2022-01-17, 18:41:48 +08] {local_task_job.py:212} WARNING - State of this 
instance has been externally set to scheduled. Terminating instance.
   [2022-01-17, 18:41:48 +08] {process_utils.py:124} INFO - Sending 
Signals.SIGTERM to group 16. PIDs of all processes in the group: [16]
   [2022-01-17, 18:41:48 +08] {process_utils.py:75} INFO - Sending the signal 
Signals.SIGTERM to group 16
   [2022-01-17, 18:41:48 +08] {taskinstance.py:1408} ERROR - Received SIGTERM. 
Terminating subprocesses.
   [2022-01-17, 18:41:48 +08] {taskinstance.py:1700} ERROR - Task failed with 
exception
   ```
   
   Successful tasks were also randomly flagged as failed:
   ```
   [2022-01-17, 18:41:20 +08] {kubernetes_pod.py:372} INFO - creating pod with 
labels {'dag_id': 'stress-test-kubepodoperator', 'task_id': 'task_44', 
'execution_date': '2022-01-17T103621.2231870000-7740efddd', 'try_number': '1'} 
and launcher <airflow.providers.cncf.kubernetes.utils.pod_launcher.PodLauncher 
object at 0x7f569d311f90>
   [2022-01-17, 18:41:20 +08] {pod_launcher.py:216} INFO - Event: 
airflow-test-pod.c309a1eaf221470a882edf2cb57f9529 had an event of type Pending
   [2022-01-17, 18:41:20 +08] {pod_launcher.py:133} WARNING - Pod not yet 
started: airflow-test-pod.c309a1eaf221470a882edf2cb57f9529
   [2022-01-17, 18:41:21 +08] {pod_launcher.py:216} INFO - Event: 
airflow-test-pod.c309a1eaf221470a882edf2cb57f9529 had an event of type Pending
   [2022-01-17, 18:41:21 +08] {pod_launcher.py:133} WARNING - Pod not yet 
started: airflow-test-pod.c309a1eaf221470a882edf2cb57f9529
   [2022-01-17, 18:41:22 +08] {pod_launcher.py:216} INFO - Event: 
airflow-test-pod.c309a1eaf221470a882edf2cb57f9529 had an event of type Pending
   [2022-01-17, 18:41:22 +08] {pod_launcher.py:133} WARNING - Pod not yet 
started: airflow-test-pod.c309a1eaf221470a882edf2cb57f9529
   [2022-01-17, 18:41:23 +08] {pod_launcher.py:216} INFO - Event: 
airflow-test-pod.c309a1eaf221470a882edf2cb57f9529 had an event of type Running
   [2022-01-17, 18:41:23 +08] {pod_launcher.py:159} INFO - + r=62
   [2022-01-17, 18:41:23 +08] {pod_launcher.py:159} INFO - + sleep 62s
   [2022-01-17, 18:42:25 +08] {pod_launcher.py:216} INFO - Event: 
airflow-test-pod.c309a1eaf221470a882edf2cb57f9529 had an event of type Succeeded
   [2022-01-17, 18:42:25 +08] {pod_launcher.py:333} INFO - Event with job id 
airflow-test-pod.c309a1eaf221470a882edf2cb57f9529 Succeeded
   [2022-01-17, 18:42:25 +08] {pod_launcher.py:216} INFO - Event: 
airflow-test-pod.c309a1eaf221470a882edf2cb57f9529 had an event of type Succeeded
   [2022-01-17, 18:42:25 +08] {pod_launcher.py:333} INFO - Event with job id 
airflow-test-pod.c309a1eaf221470a882edf2cb57f9529 Succeeded
   [2022-01-17, 18:42:25 +08] {taskinstance.py:1277} INFO - Marking task as 
SUCCESS. dag_id=stress-test-kubepodoperator, task_id=task_44, 
execution_date=20220117T103621, start_date=20220117T104119, 
end_date=20220117T104225
   [2022-01-17, 18:42:25 +08] {local_task_job.py:154} INFO - Task exited with 
return code 0
   [2022-01-17, 18:42:25 +08] {local_task_job.py:264} INFO - 0 downstream tasks 
scheduled from follow-on schedule check
   ```
   
   Environment information:
   - deployed using official helm charts
   - Airflow version: v2.2.3
   - No. of scheduler replicas: 3 (each with 3 CPU and 6Gi memory request)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to