[GitHub] [airflow] gaom25 opened a new issue #16029: Worker exited prematurely SIGTERM

GitBox Mon, 24 May 2021 11:35:54 -0700


gaom25 opened a new issue #16029:
URL: https://github.com/apache/airflow/issues/16029



   **Apache Airflow version**: 2.0.1
   
   
   **Kubernetes version (if you are using kubernetes)** (use `kubectl version`):
   
   **Environment**:
   
   - **Cloud provider or hardware configuration**: AWS EC2 t3.xlarge
   - **OS** (e.g. from /etc/os-release): Debian
   - **Kernel** (e.g. `uname -a`): 4.19.0-16-cloud-amd64 #1 SMP Debian 
4.19.181-1 (2021-03-19) x86_64 GNU/Linux
   - **Install tools**: docker standalone
   - **Others**:
   
   **What happened**:
   
   When running a Dag with 25 dynamic tasks that are basically a BashOperators 
that execute `sleep 10` some tasks are get killed externally, the worker logs 
looks like this 
   ```May 24 17:26:51 ip-172-21-74-208 docker[32419]: [2021-05-24 17:26:51,709: 
ERROR/MainProcess] Task handler raised error: WorkerLostError('Worker exited 
prematurely: signal 15 (SIGTERM) Job: 18.')
   May 24 17:26:51 ip-172-21-74-208 docker[32419]: Traceback (most recent call 
last):
   May 24 17:26:51 ip-172-21-74-208 docker[32419]:   File 
"/usr/local/lib/python3.7/dist-packages/billiard/pool.py", line 1267, in 
mark_as_worker_lost
   May 24 17:26:51 ip-172-21-74-208 docker[32419]:     human_status(exitcode), 
job._job),
   May 24 17:26:51 ip-172-21-74-208 docker[32419]: 
billiard.exceptions.WorkerLostError: Worker exited prematurely: signal 15 
(SIGTERM) Job: 18.
   ```
   
   And scheduler logs show this:
   ```
   May 24 17:26:52 ip-172-21-74-201 docker[21979]: {"asctime": "2021-05-24 
17:26:52,697", "name": "airflow.jobs.scheduler_job.SchedulerJob", "filename": 
"scheduler_job.py", "lineno": 1256, "levelname": "ERROR", "message": "Executor 
reports task instance <TaskInstance: vm_in_machine_dag.vm_in_machine_task-18 
2021-05-24 17:24:20+00:00 [queued]> finished (failed) although the task says 
its queued. (Info: None) Was the task killed externally?"}
   May 24 17:26:52 ip-172-21-74-201 docker[21979]: {"asctime": "2021-05-24 
17:26:52,699", "name": "airflow.processor_manager", "filename": 
"dag_processing.py", "lineno": 621, "levelname": "DEBUG", "message": "Received 
{'full_filepath': 
'/opt/nonroot/airflow/dags/airflow-dag-templates--vm_in_machine/airflow-sample-dags/vm_in_machine/busy_task.py',
 'msg': 'Executor reports task instance <TaskInstance: 
vm_in_machine_dag.vm_in_machine_task-18 2021-05-24 17:24:20+00:00 [queued]> 
finished (failed) although the task says its queued. (Info: None) Was the task 
killed externally?', 'simple_task_instance': 
<airflow.models.taskinstance.SimpleTaskInstance object at 0x7fea607a44a8>, 
'is_failure_callback': True} signal from DagFileProcessorAgent"}
   ```
   
   **What you expected to happen**:
   
   That the DAG runs correctly without any task fails because the usage of 
machine resources is below of 50% either memory or cpu
   
   **How to reproduce it**:
   
   This is the DAG code 
   ```
   from airflow.operators.bash_operator import BashOperator
   from airflow.models import DAG
   import logging
   from datetime import timedelta
   
   logger = logging.getLogger('airflow.task')
   default_args = {
       'depends_on_past': False,
       'start_date': '2021-01-21',
       'retries': 0,
       'retry_delay': timedelta(minutes=1),
       'email_on_failure': False,
       'email_on_retry': False,
       'owner': 'airflow'
   }
   dag = DAG(
       dag_id="vm_in_machine_dag",
       default_args=default_args,
       schedule_interval='@once',
       catchup=False)
   
   
   def dynamic_tasks():
       tasks_count = 25
       for i in range(tasks_count):
           BashOperator(
               task_id="vm_in_machine_task-"+str(i),
               depends_on_past=False,
               bash_command="sleep 10s",
               dag=dag)
   
   
   dynamic_tasks()
   
   ```
   To reproduce it launch 3 t3.xlarge EC2 instances, where each instance can 
see the other, and launch 7 instances of airflow where each instance will have 
an airflow component, that means, for instance 1 all the webservers, for 
instance 2 all the scheduler and for instance 3 all the workers
   
   Then launch all the dags at the same time, we use this bash script to 
trigger all dags at the same time 
   ```
   #!/bin/bash
     
   
   dags_ids=("vm_in_machine_dag")
   
   for i in $(docker ps -f name=APPUSE --format "{{.Names}}" | awk '{print $1}')
   do
     for j in "${dags_ids[@]}"; do
       docker exec -e AA=$j $i /bin/bash -c 'airflow dags unpause $AA; airflow 
dags trigger $AA' &
     done
   done
   echo "Done"
   
   ```
   We call each airflow instance with the same prefix for example for Airflow 
instance 1 it the components will be:
   * APPUSEA-af-ws: webserver
   * APPUSEA-af-sch: scheduler
   * APPUSEA-af-wkr: worker
   
   For that the bash script filter the container images for name that starts 
with APPUSE
   
   After a while, like 5 min there will be some task with failed state, maybe 
it could have logged or doesn't, and maybe the log will say that the task is 
marked as SUCCESS
   
   **Anything else we need to know**:
   
   There is a thread in throubleshooting slack channel, this is the link 
https://apache-airflow.slack.com/archives/CCQ7EGB1P/p1620925239197200
   
   Lastly we were playing with the configuration changing the values of :
   * scheduler_health_check_threshold
   * scheduler_heartbeat_sec
   * job_heartbeat_sec
   * web_server_worker_timeout
   * web_server_master_timeout
   increasing the values in each run and right now we are having values related 
to 420 seconds for health_check, 300 seconds for heartbeat and so on.
   
   The rest of the values are default values
   
   With this we try to know which are the limits of instances that a t3.xlarge 
EC2 instances can handle for each Airflow component before failing, and we 
create a basic dag that stress the parallelism of the machine.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [airflow] gaom25 opened a new issue #16029: Worker exited prematurely SIGTERM

Reply via email to