bipin2295 opened a new issue #22528:
URL: https://github.com/apache/airflow/issues/22528
### Apache Airflow version
2.2.3
### What happened
We are using Airflow with KubernetesExecutor. During the execution of job,
the airflow pod seems to be restarted or terminated, which has caused the
running job to be marked as failed with SIGTERM error.
Below is the log in airflow:
```
2022-03-25, 19:09:45 IST] {local_task_job.py:82} ERROR - Received SIGTERM.
Terminating subprocesses
[2022-03-25, 19:09:45 IST] {process_utils.py:120} INFO - Sending
Signals.SIGTERM to group 121. PIDs of all processes in the group: [122, 121]
[2022-03-25, 19:09:45 IST] {process_utils.py:75} INFO - Sending the signal
Signals.SIGTERM to group 121
[2022-03-25, 19:09:45 IST] {taskinstance.py:1408} ERROR - Received SIGTERM.
Terminating subprocesses.
[2022-03-25, 19:09:45 IST] {spark_submit.py:623} INFO - Sending kill signal
to spark-submit
[2022-03-25, 19:09:45 IST] {taskinstance.py:1700} ERROR - Task failed with
exception
Traceback (most recent call last):
File
"/home/airflow/.local/lib/python3.8/site-packages/airflow/models/taskinstance.py",
line 1329, in _run_raw_task
self._execute_task_with_callbacks(context)
File
"/home/airflow/.local/lib/python3.8/site-packages/airflow/models/taskinstance.py",
line 1455, in _execute_task_with_callbacks
result = self._execute_task(context, self.task)
File
"/home/airflow/.local/lib/python3.8/site-packages/airflow/models/taskinstance.py",
line 1511, in _execute_task
result = execute_callable(context=context)
File
"/home/airflow/.local/lib/python3.8/site-packages/airflow/providers/apache/spark/operators/spark_submit.py",
line 157, in execute
self._hook.submit(self._application)
File
"/home/airflow/.local/lib/python3.8/site-packages/airflow/providers/apache/spark/hooks/spark_submit.py",
line 407, in submit
self._process_spark_submit_log(iter(self._submit_sp.stdout)) # type:
ignore
File
"/home/airflow/.local/lib/python3.8/site-packages/airflow/providers/apache/spark/hooks/spark_submit.py",
line 456, in _process_spark_submit_log
for line in itr:
File
"/home/airflow/.local/lib/python3.8/site-packages/airflow/models/taskinstance.py",
line 1410, in signal_handler
raise AirflowException("Task received SIGTERM signal")
airflow.exceptions.AirflowException: Task received SIGTERM signal
[2022-03-25, 19:09:45 IST] {taskinstance.py:1267} INFO - Marking task as
FAILED. dag_id=kda_create_model_alpha, task_id=create_model,
execution_date=20220325T124433, start_date=20220325T124451,
end_date=20220325T133945
[2022-03-25, 19:09:46 IST] {standard_task_runner.py:89} ERROR - Failed to
execute job 2451 for task create_model
Traceback (most recent call last):
File
"/home/airflow/.local/lib/python3.8/site-packages/airflow/task/task_runner/standard_task_runner.py",
line 85, in _start_by_fork
args.func(args, dag=self.dag)
File
"/home/airflow/.local/lib/python3.8/site-packages/airflow/cli/cli_parser.py",
line 48, in command
return func(*args, **kwargs)
File
"/home/airflow/.local/lib/python3.8/site-packages/airflow/utils/cli.py", line
92, in wrapper
return f(*args, **kwargs)
File
"/home/airflow/.local/lib/python3.8/site-packages/airflow/cli/commands/task_command.py",
line 298, in task_run
_run_task_by_selected_method(args, dag, ti)
File
"/home/airflow/.local/lib/python3.8/site-packages/airflow/cli/commands/task_command.py",
line 107, in _run_task_by_selected_method
_run_raw_task(args, ti)
File
"/home/airflow/.local/lib/python3.8/site-packages/airflow/cli/commands/task_command.py",
line 180, in _run_raw_task
ti._run_raw_task(
File
"/home/airflow/.local/lib/python3.8/site-packages/airflow/utils/session.py",
line 70, in wrapper
return func(*args, session=session, **kwargs)
File
"/home/airflow/.local/lib/python3.8/site-packages/airflow/models/taskinstance.py",
line 1329, in _run_raw_task
self._execute_task_with_callbacks(context)
File
"/home/airflow/.local/lib/python3.8/site-packages/airflow/models/taskinstance.py",
line 1455, in _execute_task_with_callbacks
result = self._execute_task(context, self.task)
File
"/home/airflow/.local/lib/python3.8/site-packages/airflow/models/taskinstance.py",
line 1511, in _execute_task
result = execute_callable(context=context)
File
"/home/airflow/.local/lib/python3.8/site-packages/airflow/providers/apache/spark/operators/spark_submit.py",
line 157, in execute
self._hook.submit(self._application)
File
"/home/airflow/.local/lib/python3.8/site-packages/airflow/providers/apache/spark/hooks/spark_submit.py",
line 407, in submit
self._process_spark_submit_log(iter(self._submit_sp.stdout)) # type:
ignore
File
"/home/airflow/.local/lib/python3.8/site-packages/airflow/providers/apache/spark/hooks/spark_submit.py",
line 456, in _process_spark_submit_log
for line in itr:
File
"/home/airflow/.local/lib/python3.8/site-packages/airflow/models/taskinstance.py",
line 1410, in signal_handler
raise AirflowException("Task received SIGTERM signal")
airflow.exceptions.AirflowException: Task received SIGTERM signal
```
Below is the log within the Airflow worker pod:
```
Running <TaskInstance:
b8455e69-ad99-4721-8a40-f0a7fe877389_623db928e9c8b434fa742404_24c566dc-f77a-4606-b38a-3f33f9199819
[queued]> on host 1a1606ebb2314870b3b2bea7daf32547
```
Below is the log within the scheduler pod during that instance:
```
Fast evaluation: node ip-XX-XX-XX-XXX.ec2.internal cannot be removed:
airflow/1a1606ebb2314870b3b2bea7daf32547 is not replicated
Running <TaskInstance:
b8455e69-ad99-4721-8a40-f0a7fe877389_623db928e9c8b434fa742404_24c566dc-f77a-4606-b38a-3f33f9199819
[queued]> on host 1a1606ebb2314870b3b2bea7daf32547
```
### What you think should happen instead
The worker pod shouldn't have got terminated or restarted until the job
completes.
### How to reproduce
_No response_
### Operating System
Debian GNU/Linux
### Versions of Apache Airflow Providers
_No response_
### Deployment
Official Apache Airflow Helm Chart
### Deployment details
_No response_
### Anything else
_No response_
### Are you willing to submit PR?
- [ ] Yes I am willing to submit a PR!
### Code of Conduct
- [X] I agree to follow this project's [Code of
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]