bipin2295 opened a new issue #22528:
URL: https://github.com/apache/airflow/issues/22528


   ### Apache Airflow version
   
   2.2.3
   
   ### What happened
   
   We are using Airflow with KubernetesExecutor. During the execution of job, 
the airflow pod seems to be restarted or terminated, which has caused the 
running job to be marked as failed with SIGTERM error.
   
   Below is the log in airflow:
   ```
   2022-03-25, 19:09:45 IST] {local_task_job.py:82} ERROR - Received SIGTERM. 
Terminating subprocesses
   [2022-03-25, 19:09:45 IST] {process_utils.py:120} INFO - Sending 
Signals.SIGTERM to group 121. PIDs of all processes in the group: [122, 121]
   [2022-03-25, 19:09:45 IST] {process_utils.py:75} INFO - Sending the signal 
Signals.SIGTERM to group 121
   [2022-03-25, 19:09:45 IST] {taskinstance.py:1408} ERROR - Received SIGTERM. 
Terminating subprocesses.
   [2022-03-25, 19:09:45 IST] {spark_submit.py:623} INFO - Sending kill signal 
to spark-submit
   [2022-03-25, 19:09:45 IST] {taskinstance.py:1700} ERROR - Task failed with 
exception
   Traceback (most recent call last):
     File 
"/home/airflow/.local/lib/python3.8/site-packages/airflow/models/taskinstance.py",
 line 1329, in _run_raw_task
       self._execute_task_with_callbacks(context)
     File 
"/home/airflow/.local/lib/python3.8/site-packages/airflow/models/taskinstance.py",
 line 1455, in _execute_task_with_callbacks
       result = self._execute_task(context, self.task)
     File 
"/home/airflow/.local/lib/python3.8/site-packages/airflow/models/taskinstance.py",
 line 1511, in _execute_task
       result = execute_callable(context=context)
     File 
"/home/airflow/.local/lib/python3.8/site-packages/airflow/providers/apache/spark/operators/spark_submit.py",
 line 157, in execute
       self._hook.submit(self._application)
     File 
"/home/airflow/.local/lib/python3.8/site-packages/airflow/providers/apache/spark/hooks/spark_submit.py",
 line 407, in submit
       self._process_spark_submit_log(iter(self._submit_sp.stdout))  # type: 
ignore
     File 
"/home/airflow/.local/lib/python3.8/site-packages/airflow/providers/apache/spark/hooks/spark_submit.py",
 line 456, in _process_spark_submit_log
       for line in itr:
     File 
"/home/airflow/.local/lib/python3.8/site-packages/airflow/models/taskinstance.py",
 line 1410, in signal_handler
       raise AirflowException("Task received SIGTERM signal")
   airflow.exceptions.AirflowException: Task received SIGTERM signal
   [2022-03-25, 19:09:45 IST] {taskinstance.py:1267} INFO - Marking task as 
FAILED. dag_id=kda_create_model_alpha, task_id=create_model, 
execution_date=20220325T124433, start_date=20220325T124451, 
end_date=20220325T133945
   [2022-03-25, 19:09:46 IST] {standard_task_runner.py:89} ERROR - Failed to 
execute job 2451 for task create_model
   Traceback (most recent call last):
     File 
"/home/airflow/.local/lib/python3.8/site-packages/airflow/task/task_runner/standard_task_runner.py",
 line 85, in _start_by_fork
       args.func(args, dag=self.dag)
     File 
"/home/airflow/.local/lib/python3.8/site-packages/airflow/cli/cli_parser.py", 
line 48, in command
       return func(*args, **kwargs)
     File 
"/home/airflow/.local/lib/python3.8/site-packages/airflow/utils/cli.py", line 
92, in wrapper
       return f(*args, **kwargs)
     File 
"/home/airflow/.local/lib/python3.8/site-packages/airflow/cli/commands/task_command.py",
 line 298, in task_run
       _run_task_by_selected_method(args, dag, ti)
     File 
"/home/airflow/.local/lib/python3.8/site-packages/airflow/cli/commands/task_command.py",
 line 107, in _run_task_by_selected_method
       _run_raw_task(args, ti)
     File 
"/home/airflow/.local/lib/python3.8/site-packages/airflow/cli/commands/task_command.py",
 line 180, in _run_raw_task
       ti._run_raw_task(
     File 
"/home/airflow/.local/lib/python3.8/site-packages/airflow/utils/session.py", 
line 70, in wrapper
       return func(*args, session=session, **kwargs)
     File 
"/home/airflow/.local/lib/python3.8/site-packages/airflow/models/taskinstance.py",
 line 1329, in _run_raw_task
       self._execute_task_with_callbacks(context)
     File 
"/home/airflow/.local/lib/python3.8/site-packages/airflow/models/taskinstance.py",
 line 1455, in _execute_task_with_callbacks
       result = self._execute_task(context, self.task)
     File 
"/home/airflow/.local/lib/python3.8/site-packages/airflow/models/taskinstance.py",
 line 1511, in _execute_task
       result = execute_callable(context=context)
     File 
"/home/airflow/.local/lib/python3.8/site-packages/airflow/providers/apache/spark/operators/spark_submit.py",
 line 157, in execute
       self._hook.submit(self._application)
     File 
"/home/airflow/.local/lib/python3.8/site-packages/airflow/providers/apache/spark/hooks/spark_submit.py",
 line 407, in submit
       self._process_spark_submit_log(iter(self._submit_sp.stdout))  # type: 
ignore
     File 
"/home/airflow/.local/lib/python3.8/site-packages/airflow/providers/apache/spark/hooks/spark_submit.py",
 line 456, in _process_spark_submit_log
       for line in itr:
     File 
"/home/airflow/.local/lib/python3.8/site-packages/airflow/models/taskinstance.py",
 line 1410, in signal_handler
       raise AirflowException("Task received SIGTERM signal")
   airflow.exceptions.AirflowException: Task received SIGTERM signal
   ```
   
   
   Below is the log within the Airflow worker pod:
   
   ```
   Running <TaskInstance: 
b8455e69-ad99-4721-8a40-f0a7fe877389_623db928e9c8b434fa742404_24c566dc-f77a-4606-b38a-3f33f9199819
 [queued]> on host 1a1606ebb2314870b3b2bea7daf32547
   ```
   
   
   Below is the log within the scheduler pod during that instance:
   
   ```
   Fast evaluation: node ip-XX-XX-XX-XXX.ec2.internal cannot be removed: 
airflow/1a1606ebb2314870b3b2bea7daf32547 is not replicated
   
   
   Running <TaskInstance:  
b8455e69-ad99-4721-8a40-f0a7fe877389_623db928e9c8b434fa742404_24c566dc-f77a-4606-b38a-3f33f9199819
 [queued]> on host 1a1606ebb2314870b3b2bea7daf32547 
   
   ```
   
   ### What you think should happen instead
   
   The worker pod shouldn't have got terminated or restarted until the job 
completes.
   
   ### How to reproduce
   
   _No response_
   
   ### Operating System
   
   Debian GNU/Linux
   
   ### Versions of Apache Airflow Providers
   
   _No response_
   
   ### Deployment
   
   Official Apache Airflow Helm Chart
   
   ### Deployment details
   
   _No response_
   
   ### Anything else
   
   _No response_
   
   ### Are you willing to submit PR?
   
   - [ ] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of 
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to