MartinKChen opened a new issue #19043:
URL: https://github.com/apache/airflow/issues/19043
### Apache Airflow version
2.0.2
### Operating System
Linux (AWS MWAA)
### Versions of Apache Airflow Providers
MWAA with Apache Airflow 2.0.2
### Deployment
MWAA
### Deployment details
_No response_
### What happened
Shard tasks of SmartSensor keep terminated randomly and with no logs.
**Case 1: Shard been terminated after tasks sensed without any log.**
[2021-10-18 07:10:15,580] {{smart_sensor.py:358}} INFO - Performance query 0
tis, time: 0.008
[2021-10-18 07:10:15,605] {{smart_sensor.py:373}} INFO - 0 tasks detected.
[2021-10-18 07:10:15,633] {{smart_sensor.py:739}} INFO - Loaded 0
sensor_works
[2021-10-18 07:10:15,666] {{smart_sensor.py:747}} INFO - Taking 0.09376 to
execute 0 tasks.
[2021-10-18 07:13:15,671] {{smart_sensor.py:358}} INFO - Performance query 0
tis, time: 0.014
[2021-10-18 07:13:15,736] {{smart_sensor.py:373}} INFO - 0 tasks detected.
[2021-10-18 07:13:15,768] {{smart_sensor.py:739}} INFO - Loaded 0
sensor_works
[2021-10-18 07:13:15,844] {{smart_sensor.py:747}} INFO - Taking 0.186399 to
execute 0 tasks.
[2021-10-18 07:16:15,794] {{smart_sensor.py:358}} INFO - Performance query 4
tis, time: 0.008
[2021-10-18 07:35:25,736] {{process_utils.py:100}} INFO - Sending
Signals.SIGTERM to GPID 13284
[2021-10-18 07:35:25,996] {{process_utils.py:66}} INFO - Process
psutil.Process(pid=13284, status='terminated', exitcode=1, started='07:10:13')
(13284) terminated with exit code 1
**Case 2: Shard been terminated after 1 day with 0 task sensed within the
cycle (last for around 24 hours). And the log is unclear to what cause the
termination**
[2021-10-13 07:27:28,922] {{smart_sensor.py:358}} INFO - Performance query 0
tis, time: 0.013
[2021-10-13 07:27:28,989] {{smart_sensor.py:373}} INFO - 0 tasks detected.
[2021-10-13 07:27:29,021] {{smart_sensor.py:739}} INFO - Loaded 0
sensor_works
[2021-10-13 07:27:29,044] {{smart_sensor.py:747}} INFO - Taking 0.135416 to
execute 0 tasks.
[2021-10-13 07:30:29,065] {{smart_sensor.py:358}} INFO - Performance query 0
tis, time: 0.012
[2021-10-13 07:30:29,114] {{smart_sensor.py:373}} INFO - 0 tasks detected.
[2021-10-13 07:30:29,141] {{smart_sensor.py:739}} INFO - Loaded 0
sensor_works
[2021-10-13 07:30:29,163] {{smart_sensor.py:747}} INFO - Taking 0.11131 to
execute 0 tasks.
[2021-10-13 07:33:29,155] {{smart_sensor.py:358}} INFO - Performance query 0
tis, time: 0.006
[2021-10-13 07:33:29,201] {{smart_sensor.py:373}} INFO - 0 tasks detected.
.......
[2021-10-14 07:25:25,821] {{smart_sensor.py:358}} INFO - Performance query 0
tis, time: 0.006
[2021-10-14 07:25:25,870] {{smart_sensor.py:373}} INFO - 0 tasks detected.
[2021-10-14 07:25:25,896] {{smart_sensor.py:739}} INFO - Loaded 0
sensor_works
[2021-10-14 07:25:25,922] {{smart_sensor.py:747}} INFO - Taking 0.10663 to
execute 0 tasks.
[2021-10-14 07:27:30,839] {{process_utils.py:100}} INFO - Sending
Signals.SIGTERM to GPID 25285
[2021-10-14 07:27:30,933] {{taskinstance.py:1265}} ERROR - Received SIGTERM.
Terminating subprocesses.
[2021-10-14 07:27:31,028] {{taskinstance.py:1482}} ERROR - Task failed with
exception
Traceback (most recent call last):
File
"/usr/local/lib/python3.7/site-packages/airflow/models/taskinstance.py", line
1138, in _run_raw_task
self._prepare_and_execute_task_with_callbacks(context, task)
File
"/usr/local/lib/python3.7/site-packages/airflow/models/taskinstance.py", line
1311, in _prepare_and_execute_task_with_callbacks
result = self._execute_task(context, task_copy)
File
"/usr/local/lib/python3.7/site-packages/airflow/models/taskinstance.py", line
1341, in _execute_task
result = task_copy.execute(context=context)
File
"/usr/local/lib/python3.7/site-packages/airflow/sensors/smart_sensor.py", line
754, in execute
sleep(self.poke_interval - duration)
File
"/usr/local/lib/python3.7/site-packages/airflow/models/taskinstance.py", line
1267, in signal_handler
raise AirflowException("Task received SIGTERM signal")
airflow.exceptions.AirflowException: Task received SIGTERM signal
[2021-10-14 07:27:31,060] {{taskinstance.py:1532}} INFO - Marking task as
UP_FOR_RETRY. dag_id=smart_sensor_group_shard_1, task_id=smart_sensor_task,
execution_date=20211013T072226, start_date=20211013T072728,
end_date=20211014T072731
[2021-10-14 07:27:31,146] {{process_utils.py:66}} INFO - Process
psutil.Process(pid=25285, status='terminated', exitcode=1, started='2021-10-13
07:27:28') (25285) terminated with exit code 1
### What you expected to happen
1. Shards of SmartSensor should not be terminated unless there is an error
occured
2. If Shards are terminated due to any errors, there should have
corresponding logs when we check task log of the Shard.
### How to reproduce
1. Enable SmartSensor
2. Schedule DAGs with SmartSensor enabled operators.
3. Let the DAG run for couple hours/days.
### Anything else
The error/termination occurs every couple seconds/hours with no pattern.
Which cause us hard to monitor as most of them are just false alarm due to the
self-termination.
### Are you willing to submit PR?
- [ ] Yes I am willing to submit a PR!
### Code of Conduct
- [X] I agree to follow this project's [Code of
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]