gaom25 opened a new issue #16029:
URL: https://github.com/apache/airflow/issues/16029
**Apache Airflow version**: 2.0.1
**Kubernetes version (if you are using kubernetes)** (use `kubectl version`):
**Environment**:
- **Cloud provider or hardware configuration**: AWS EC2 t3.xlarge
- **OS** (e.g. from /etc/os-release): Debian
- **Kernel** (e.g. `uname -a`): 4.19.0-16-cloud-amd64 #1 SMP Debian
4.19.181-1 (2021-03-19) x86_64 GNU/Linux
- **Install tools**: docker standalone
- **Others**:
**What happened**:
When running a Dag with 25 dynamic tasks that are basically a BashOperators
that execute `sleep 10` some tasks are get killed externally, the worker logs
looks like this
```May 24 17:26:51 ip-172-21-74-208 docker[32419]: [2021-05-24 17:26:51,709:
ERROR/MainProcess] Task handler raised error: WorkerLostError('Worker exited
prematurely: signal 15 (SIGTERM) Job: 18.')
May 24 17:26:51 ip-172-21-74-208 docker[32419]: Traceback (most recent call
last):
May 24 17:26:51 ip-172-21-74-208 docker[32419]: File
"/usr/local/lib/python3.7/dist-packages/billiard/pool.py", line 1267, in
mark_as_worker_lost
May 24 17:26:51 ip-172-21-74-208 docker[32419]: human_status(exitcode),
job._job),
May 24 17:26:51 ip-172-21-74-208 docker[32419]:
billiard.exceptions.WorkerLostError: Worker exited prematurely: signal 15
(SIGTERM) Job: 18.
```
And scheduler logs show this:
```
May 24 17:26:52 ip-172-21-74-201 docker[21979]: {"asctime": "2021-05-24
17:26:52,697", "name": "airflow.jobs.scheduler_job.SchedulerJob", "filename":
"scheduler_job.py", "lineno": 1256, "levelname": "ERROR", "message": "Executor
reports task instance <TaskInstance: vm_in_machine_dag.vm_in_machine_task-18
2021-05-24 17:24:20+00:00 [queued]> finished (failed) although the task says
its queued. (Info: None) Was the task killed externally?"}
May 24 17:26:52 ip-172-21-74-201 docker[21979]: {"asctime": "2021-05-24
17:26:52,699", "name": "airflow.processor_manager", "filename":
"dag_processing.py", "lineno": 621, "levelname": "DEBUG", "message": "Received
{'full_filepath':
'/opt/nonroot/airflow/dags/airflow-dag-templates--vm_in_machine/airflow-sample-dags/vm_in_machine/busy_task.py',
'msg': 'Executor reports task instance <TaskInstance:
vm_in_machine_dag.vm_in_machine_task-18 2021-05-24 17:24:20+00:00 [queued]>
finished (failed) although the task says its queued. (Info: None) Was the task
killed externally?', 'simple_task_instance':
<airflow.models.taskinstance.SimpleTaskInstance object at 0x7fea607a44a8>,
'is_failure_callback': True} signal from DagFileProcessorAgent"}
```
**What you expected to happen**:
That the DAG runs correctly without any task fails because the usage of
machine resources is below of 50% either memory or cpu
**How to reproduce it**:
This is the DAG code
```
from airflow.operators.bash_operator import BashOperator
from airflow.models import DAG
import logging
from datetime import timedelta
logger = logging.getLogger('airflow.task')
default_args = {
'depends_on_past': False,
'start_date': '2021-01-21',
'retries': 0,
'retry_delay': timedelta(minutes=1),
'email_on_failure': False,
'email_on_retry': False,
'owner': 'airflow'
}
dag = DAG(
dag_id="vm_in_machine_dag",
default_args=default_args,
schedule_interval='@once',
catchup=False)
def dynamic_tasks():
tasks_count = 25
for i in range(tasks_count):
BashOperator(
task_id="vm_in_machine_task-"+str(i),
depends_on_past=False,
bash_command="sleep 10s",
dag=dag)
dynamic_tasks()
```
To reproduce it launch 3 t3.xlarge EC2 instances, where each instance can
see the other, and launch 7 instances of airflow where each instance will have
an airflow component, that means, for instance 1 all the webservers, for
instance 2 all the scheduler and for instance 3 all the workers
Then launch all the dags at the same time, we use this bash script to
trigger all dags at the same time
```
#!/bin/bash
dags_ids=("vm_in_machine_dag")
for i in $(docker ps -f name=APPUSE --format "{{.Names}}" | awk '{print $1}')
do
for j in "${dags_ids[@]}"; do
docker exec -e AA=$j $i /bin/bash -c 'airflow dags unpause $AA; airflow
dags trigger $AA' &
done
done
echo "Done"
```
We call each airflow instance with the same prefix for example for Airflow
instance 1 it the components will be:
* APPUSEA-af-ws: webserver
* APPUSEA-af-sch: scheduler
* APPUSEA-af-wkr: worker
For that the bash script filter the container images for name that starts
with APPUSE
After a while, like 5 min there will be some task with failed state, maybe
it could have logged or doesn't, and maybe the log will say that the task is
marked as SUCCESS
**Anything else we need to know**:
There is a thread in throubleshooting slack channel, this is the link
https://apache-airflow.slack.com/archives/CCQ7EGB1P/p1620925239197200
Lastly we were playing with the configuration changing the values of :
* scheduler_health_check_threshold
* scheduler_heartbeat_sec
* job_heartbeat_sec
* web_server_worker_timeout
* web_server_master_timeout
increasing the values in each run and right now we are having values related
to 420 seconds for health_check, 300 seconds for heartbeat and so on.
The rest of the values are default values
With this we try to know which are the limits of instances that a t3.xlarge
EC2 instances can handle for each Airflow component before failing, and we
create a basic dag that stress the parallelism of the machine.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]