MartinKChen commented on issue #19043:
URL: https://github.com/apache/airflow/issues/19043#issuecomment-950734099
After dig logs a bit deeper, I have found the patterns (and issues?). And
seems the issue should have nothing to do with MWAA environment.
1. The smart_sensor_task in all Shard DAGs create a new attempt when it pick
up SIs/TIs. While new attempt means retry as failed for other (non-SmartSensor)
tasks
2. It seems like smart_sensor_task is set to have maximum 101 attempts, does
this means the task will eventually failed and create a new DAG run after 101
times it execute some sensors to poke due to # 1.
```
[2021-10-25 07:20:29,820] {{taskinstance.py:1069}} INFO - Starting attempt
61 of 101
[2021-10-25 07:20:29,849] {{taskinstance.py:1070}} INFO -
```
3. The smart_sensor_task missed logs after it pick up SIs/TIs. To be more
specific, it missed logs after "_load_sensor_works" function executed, like
"Loaded %s sensor_works" and "Taking %s to execute %s tasks".
4. All these Shard DAG runs were **terminated 1 day after it started**. I am
wondering if it's because the timeout condition, but hard to tell since logs
after _load_sensor_works" are missed due to # 3.
```
if (timezone.utcnow() - started_at).total_seconds() > self.timeout:
self.log.info("Time is out for smart sensor.")
return
```
Here are 2 cases for # 4:
- Case 1, DAG run start at 2021-10-24T07:28:55 and terminated at 2021-10-25
07:34:01 on attempt 61
```
*** Reading remote log from Cloudwatch log_group: airflow-datalake-uat-Task
log_stream:
smart_sensor_group_shard_2/smart_sensor_task/2021-10-24T07_28_55.294439+00_00/61.log.
[2021-10-25 07:20:29,498] {{taskinstance.py:877}} INFO - Dependencies all
met for <TaskInstance: smart_sensor_group_shard_2.smart_sensor_task
2021-10-24T07:28:55.294439+00:00 [queued]>
[2021-10-25 07:20:29,763] {{taskinstance.py:877}} INFO - Dependencies all
met for <TaskInstance: smart_sensor_group_shard_2.smart_sensor_task
2021-10-24T07:28:55.294439+00:00 [queued]>
[2021-10-25 07:20:29,791] {{taskinstance.py:1068}} INFO -
--------------------------------------------------------------------------------
[2021-10-25 07:20:29,820] {{taskinstance.py:1069}} INFO - Starting attempt
61 of 101
[2021-10-25 07:20:29,849] {{taskinstance.py:1070}} INFO -
--------------------------------------------------------------------------------
[2021-10-25 07:20:29,884] {{taskinstance.py:1089}} INFO - Executing
<Task(SmartSensorOperator): smart_sensor_task> on
2021-10-24T07:28:55.294439+00:00
[2021-10-25 07:20:29,919] {{standard_task_runner.py:52}} INFO - Started
process 5123 to run task
[2021-10-25 07:20:29,919] {{standard_task_runner.py:52}} INFO - Started
process 5123 to run task
[2021-10-25 07:20:30,182] {{logging_mixin.py:104}} WARNING - KeyError:
'nextSequenceToken'
[2021-10-25 07:20:30,208] {{standard_task_runner.py:77}} INFO - Job 3854695:
Subtask smart_sensor_task
[2021-10-25 07:20:30,443] {{logging_mixin.py:104}} INFO - Running
<TaskInstance: smart_sensor_group_shard_2.smart_sensor_task
2021-10-24T07:28:55.294439+00:00 [running]> on host
ip-10-2-4-96.eu-central-1.compute.internal
[2021-10-25 07:20:30,679] {{taskinstance.py:1283}} INFO - Exporting the
following env vars:
AIRFLOW_CTX_DAG_OWNER=airflow
AIRFLOW_CTX_DAG_ID=smart_sensor_group_shard_2
AIRFLOW_CTX_TASK_ID=smart_sensor_task
AIRFLOW_CTX_EXECUTION_DATE=2021-10-24T07:28:55.294439+00:00
AIRFLOW_CTX_DAG_RUN_ID=scheduled__2021-10-24T07:28:55.294439+00:00
[2021-10-25 07:20:30,710] {{smart_sensor.py:358}} INFO - Performance query 0
tis, time: 0.009
[2021-10-25 07:20:30,731] {{smart_sensor.py:373}} INFO - 0 tasks detected.
[2021-10-25 07:20:30,759] {{smart_sensor.py:739}} INFO - Loaded 0
sensor_works
[2021-10-25 07:20:30,789] {{smart_sensor.py:747}} INFO - Taking 0.088914 to
execute 0 tasks.
[2021-10-25 07:23:30,826] {{smart_sensor.py:358}} INFO - Performance query 0
tis, time: 0.010
[2021-10-25 07:23:30,884] {{smart_sensor.py:373}} INFO - 0 tasks detected.
[2021-10-25 07:23:30,913] {{smart_sensor.py:739}} INFO - Loaded 0
sensor_works
[2021-10-25 07:23:30,939] {{smart_sensor.py:747}} INFO - Taking 0.122124 to
execute 0 tasks.
[2021-10-25 07:26:30,942] {{smart_sensor.py:358}} INFO - Performance query 4
tis, time: 0.006
[2021-10-25 07:34:01,165] {{process_utils.py:100}} INFO - Sending
Signals.SIGTERM to GPID 5123
[2021-10-25 07:34:01,417] {{process_utils.py:66}} INFO - Process
psutil.Process(pid=5123, status='terminated', exitcode=1, started='07:20:29')
(5123) terminated with exit code 1
```
- Case 2, DAG run start at 2021-10-16T07:30:04 and terminated at 2021-10-17
07:35:13 on attempt 17
```
*** Reading remote log from Cloudwatch log_group: airflow-datalake-uat-Task
log_stream:
smart_sensor_group_shard_0/smart_sensor_task/2021-10-16T07_30_04.386988+00_00/17.log.
[2021-10-16 21:46:26,012] {{taskinstance.py:877}} INFO - Dependencies all
met for <TaskInstance: smart_sensor_group_shard_0.smart_sensor_task
2021-10-16T07:30:04.386988+00:00 [queued]>
[2021-10-16 21:46:26,124] {{taskinstance.py:877}} INFO - Dependencies all
met for <TaskInstance: smart_sensor_group_shard_0.smart_sensor_task
2021-10-16T07:30:04.386988+00:00 [queued]>
[2021-10-16 21:46:26,152] {{taskinstance.py:1068}} INFO -
--------------------------------------------------------------------------------
[2021-10-16 21:46:26,176] {{taskinstance.py:1069}} INFO - Starting attempt
17 of 101
[2021-10-16 21:46:26,203] {{taskinstance.py:1070}} INFO -
--------------------------------------------------------------------------------
[2021-10-16 21:46:26,245] {{taskinstance.py:1089}} INFO - Executing
<Task(SmartSensorOperator): smart_sensor_task> on
2021-10-16T07:30:04.386988+00:00
[2021-10-16 21:46:26,275] {{standard_task_runner.py:52}} INFO - Started
process 16517 to run task
[2021-10-16 21:46:26,275] {{standard_task_runner.py:52}} INFO - Started
process 16517 to run task
[2021-10-16 21:46:26,520] {{standard_task_runner.py:77}} INFO - Job 3639181:
Subtask smart_sensor_task
[2021-10-16 21:46:26,793] {{logging_mixin.py:104}} INFO - Running
<TaskInstance: smart_sensor_group_shard_0.smart_sensor_task
2021-10-16T07:30:04.386988+00:00 [running]> on host
ip-10-2-5-156.eu-central-1.compute.internal
[2021-10-16 21:46:26,904] {{taskinstance.py:1283}} INFO - Exporting the
following env vars:
AIRFLOW_CTX_DAG_OWNER=airflow
AIRFLOW_CTX_DAG_ID=smart_sensor_group_shard_0
AIRFLOW_CTX_TASK_ID=smart_sensor_task
AIRFLOW_CTX_EXECUTION_DATE=2021-10-16T07:30:04.386988+00:00
AIRFLOW_CTX_DAG_RUN_ID=scheduled__2021-10-16T07:30:04.386988+00:00
[2021-10-16 21:46:26,940] {{smart_sensor.py:358}} INFO - Performance query 3
tis, time: 0.009
[2021-10-17 07:35:13,339] {{process_utils.py:100}} INFO - Sending
Signals.SIGTERM to GPID 16517
[2021-10-17 07:35:13,567] {{process_utils.py:66}} INFO - Process
psutil.Process(pid=16517, status='terminated', exitcode=1, started='21:46:25')
(16517) terminated with exit code 1
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]