MartinKChen commented on issue #19043:
URL: https://github.com/apache/airflow/issues/19043#issuecomment-950734099


   After dig logs a bit deeper, I have found the patterns (and issues?). And 
seems the issue should have nothing to do with MWAA environment.
   
   1. The smart_sensor_task in all Shard DAGs create a new attempt when it pick 
up SIs/TIs. While new attempt means retry as failed for other (non-SmartSensor) 
tasks
   2. It seems like smart_sensor_task is set to have maximum 101 attempts, does 
this means the task will eventually failed and create a new DAG run after 101 
times it execute some sensors to poke due to # 1.
   ```
   [2021-10-25 07:20:29,820] {{taskinstance.py:1069}} INFO - Starting attempt 
61 of 101
   [2021-10-25 07:20:29,849] {{taskinstance.py:1070}} INFO - 
   ```
   3. The smart_sensor_task missed logs after it pick up SIs/TIs. To be more 
specific, it missed logs after "_load_sensor_works" function executed, like 
"Loaded %s sensor_works" and "Taking %s to execute %s tasks".
   4. All these Shard DAG runs were **terminated 1 day after it started**. I am 
wondering if it's because the timeout condition, but hard to tell since logs 
after _load_sensor_works" are missed due to # 3.
   ```
   if (timezone.utcnow() - started_at).total_seconds() > self.timeout:
   self.log.info("Time is out for smart sensor.")
   return
   ```
   Here are 2 cases for # 4:
   - Case 1, DAG run start at 2021-10-24T07:28:55 and terminated at 2021-10-25 
07:34:01 on attempt 61
   ```
   *** Reading remote log from Cloudwatch log_group: airflow-datalake-uat-Task 
log_stream: 
smart_sensor_group_shard_2/smart_sensor_task/2021-10-24T07_28_55.294439+00_00/61.log.
   [2021-10-25 07:20:29,498] {{taskinstance.py:877}} INFO - Dependencies all 
met for <TaskInstance: smart_sensor_group_shard_2.smart_sensor_task 
2021-10-24T07:28:55.294439+00:00 [queued]>
   [2021-10-25 07:20:29,763] {{taskinstance.py:877}} INFO - Dependencies all 
met for <TaskInstance: smart_sensor_group_shard_2.smart_sensor_task 
2021-10-24T07:28:55.294439+00:00 [queued]>
   [2021-10-25 07:20:29,791] {{taskinstance.py:1068}} INFO - 
   
--------------------------------------------------------------------------------
   [2021-10-25 07:20:29,820] {{taskinstance.py:1069}} INFO - Starting attempt 
61 of 101
   [2021-10-25 07:20:29,849] {{taskinstance.py:1070}} INFO - 
   
--------------------------------------------------------------------------------
   [2021-10-25 07:20:29,884] {{taskinstance.py:1089}} INFO - Executing 
<Task(SmartSensorOperator): smart_sensor_task> on 
2021-10-24T07:28:55.294439+00:00
   [2021-10-25 07:20:29,919] {{standard_task_runner.py:52}} INFO - Started 
process 5123 to run task
   [2021-10-25 07:20:29,919] {{standard_task_runner.py:52}} INFO - Started 
process 5123 to run task
   [2021-10-25 07:20:30,182] {{logging_mixin.py:104}} WARNING - KeyError: 
'nextSequenceToken'
   [2021-10-25 07:20:30,208] {{standard_task_runner.py:77}} INFO - Job 3854695: 
Subtask smart_sensor_task
   [2021-10-25 07:20:30,443] {{logging_mixin.py:104}} INFO - Running 
<TaskInstance: smart_sensor_group_shard_2.smart_sensor_task 
2021-10-24T07:28:55.294439+00:00 [running]> on host 
ip-10-2-4-96.eu-central-1.compute.internal
   [2021-10-25 07:20:30,679] {{taskinstance.py:1283}} INFO - Exporting the 
following env vars:
   AIRFLOW_CTX_DAG_OWNER=airflow
   AIRFLOW_CTX_DAG_ID=smart_sensor_group_shard_2
   AIRFLOW_CTX_TASK_ID=smart_sensor_task
   AIRFLOW_CTX_EXECUTION_DATE=2021-10-24T07:28:55.294439+00:00
   AIRFLOW_CTX_DAG_RUN_ID=scheduled__2021-10-24T07:28:55.294439+00:00
   [2021-10-25 07:20:30,710] {{smart_sensor.py:358}} INFO - Performance query 0 
tis, time: 0.009
   [2021-10-25 07:20:30,731] {{smart_sensor.py:373}} INFO - 0 tasks detected.
   [2021-10-25 07:20:30,759] {{smart_sensor.py:739}} INFO - Loaded 0 
sensor_works
   [2021-10-25 07:20:30,789] {{smart_sensor.py:747}} INFO - Taking 0.088914 to 
execute 0 tasks.
   [2021-10-25 07:23:30,826] {{smart_sensor.py:358}} INFO - Performance query 0 
tis, time: 0.010
   [2021-10-25 07:23:30,884] {{smart_sensor.py:373}} INFO - 0 tasks detected.
   [2021-10-25 07:23:30,913] {{smart_sensor.py:739}} INFO - Loaded 0 
sensor_works
   [2021-10-25 07:23:30,939] {{smart_sensor.py:747}} INFO - Taking 0.122124 to 
execute 0 tasks.
   [2021-10-25 07:26:30,942] {{smart_sensor.py:358}} INFO - Performance query 4 
tis, time: 0.006
   [2021-10-25 07:34:01,165] {{process_utils.py:100}} INFO - Sending 
Signals.SIGTERM to GPID 5123
   [2021-10-25 07:34:01,417] {{process_utils.py:66}} INFO - Process 
psutil.Process(pid=5123, status='terminated', exitcode=1, started='07:20:29') 
(5123) terminated with exit code 1
   ```
   - Case 2, DAG run start at 2021-10-16T07:30:04 and terminated at 2021-10-17 
07:35:13 on attempt 17
   ```
   *** Reading remote log from Cloudwatch log_group: airflow-datalake-uat-Task 
log_stream: 
smart_sensor_group_shard_0/smart_sensor_task/2021-10-16T07_30_04.386988+00_00/17.log.
   [2021-10-16 21:46:26,012] {{taskinstance.py:877}} INFO - Dependencies all 
met for <TaskInstance: smart_sensor_group_shard_0.smart_sensor_task 
2021-10-16T07:30:04.386988+00:00 [queued]>
   [2021-10-16 21:46:26,124] {{taskinstance.py:877}} INFO - Dependencies all 
met for <TaskInstance: smart_sensor_group_shard_0.smart_sensor_task 
2021-10-16T07:30:04.386988+00:00 [queued]>
   [2021-10-16 21:46:26,152] {{taskinstance.py:1068}} INFO - 
   
--------------------------------------------------------------------------------
   [2021-10-16 21:46:26,176] {{taskinstance.py:1069}} INFO - Starting attempt 
17 of 101
   [2021-10-16 21:46:26,203] {{taskinstance.py:1070}} INFO - 
   
--------------------------------------------------------------------------------
   [2021-10-16 21:46:26,245] {{taskinstance.py:1089}} INFO - Executing 
<Task(SmartSensorOperator): smart_sensor_task> on 
2021-10-16T07:30:04.386988+00:00
   [2021-10-16 21:46:26,275] {{standard_task_runner.py:52}} INFO - Started 
process 16517 to run task
   [2021-10-16 21:46:26,275] {{standard_task_runner.py:52}} INFO - Started 
process 16517 to run task
   [2021-10-16 21:46:26,520] {{standard_task_runner.py:77}} INFO - Job 3639181: 
Subtask smart_sensor_task
   [2021-10-16 21:46:26,793] {{logging_mixin.py:104}} INFO - Running 
<TaskInstance: smart_sensor_group_shard_0.smart_sensor_task 
2021-10-16T07:30:04.386988+00:00 [running]> on host 
ip-10-2-5-156.eu-central-1.compute.internal
   [2021-10-16 21:46:26,904] {{taskinstance.py:1283}} INFO - Exporting the 
following env vars:
   AIRFLOW_CTX_DAG_OWNER=airflow
   AIRFLOW_CTX_DAG_ID=smart_sensor_group_shard_0
   AIRFLOW_CTX_TASK_ID=smart_sensor_task
   AIRFLOW_CTX_EXECUTION_DATE=2021-10-16T07:30:04.386988+00:00
   AIRFLOW_CTX_DAG_RUN_ID=scheduled__2021-10-16T07:30:04.386988+00:00
   [2021-10-16 21:46:26,940] {{smart_sensor.py:358}} INFO - Performance query 3 
tis, time: 0.009
   [2021-10-17 07:35:13,339] {{process_utils.py:100}} INFO - Sending 
Signals.SIGTERM to GPID 16517
   [2021-10-17 07:35:13,567] {{process_utils.py:66}} INFO - Process 
psutil.Process(pid=16517, status='terminated', exitcode=1, started='21:46:25') 
(16517) terminated with exit code 1
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to