SchernHe commented on issue #24466:
URL: https://github.com/apache/airflow/issues/24466#issuecomment-1303197912

   Are there any news on this issue. I am also seeing similar behavior for a 
long running task. 
   
   My DAG includes an SSHOperator that executes a long-running process on a 
Google Cloud Compute Engine. The tasks keeps on running, even though the 
underlying process gets killed for some reason along the way. The weird thing 
is, that I do not see any log messages that states why the process was killed. 
It just stops logging. 
   
   ## General Information
   - Version: apache-airflow==2.4.2
   - Executor: CeleryKubernetesExecutor + Redis
   - Docker Setup as described 
[here](https://airflow.apache.org/docs/apache-airflow/stable/howto/docker-compose/index.html)
   - Compute Engine keeps on running and no reboot occured
   - Compute Engine shows no spikes or anything regarding CPU / GPU usage
   
   
   ## Task Logs
   
   Here are the logs of the latest execution:
   ```
   *** Reading local file: 
/opt/airflow/logs/dag_id=XXXX/run_id=manual__2022-11-03T22:37:42.838999+00:00/task_id=XXXX/attempt=1.log
   [2022-11-04, 01:31:11 UTC] {taskinstance.py:1165} INFO - Dependencies all 
met for <TaskInstance: complete_training_pipeline.XXXX 
manual__2022-11-03T22:37:42.838999+00:00 [queued]>
   [2022-11-04, 01:31:11 UTC] {taskinstance.py:1165} INFO - Dependencies all 
met for <TaskInstance: complete_training_pipeline.XXXX 
manual__2022-11-03T22:37:42.838999+00:00 [queued]>
   [2022-11-04, 01:31:11 UTC] {taskinstance.py:1362} INFO - 
   
--------------------------------------------------------------------------------
   [2022-11-04, 01:31:11 UTC] {taskinstance.py:1363} INFO - Starting attempt 1 
of 1
   [2022-11-04, 01:31:11 UTC] {taskinstance.py:1364} INFO - 
   
--------------------------------------------------------------------------------
   [2022-11-04, 01:31:11 UTC] {taskinstance.py:1383} INFO - Executing 
<Task(SSHOperator): XXXX> on 2022-11-03 22:37:42.838999+00:00
   [2022-11-04, 01:31:11 UTC] {standard_task_runner.py:55} INFO - Started 
process 18209 to run task
   [2022-11-04, 01:31:11 UTC] {standard_task_runner.py:82} INFO - Running: 
['***', 'tasks', 'run', 'XXXX', 'XXXX', 
'manual__2022-11-03T22:37:42.838999+00:00', '--job-id', '214', '--raw', 
'--subdir', 'DAGS_FOLDER/XXXX.py', '--cfg-path', '/tmp/tmp1y9_3kwh']
   [2022-11-04, 01:31:11 UTC] {standard_task_runner.py:83} INFO - Job 214: 
Subtask XXXX
   [2022-11-04, 01:31:11 UTC] {task_command.py:376} INFO - Running 
<TaskInstance: XXXX.XXXX manual__2022-11-03T22:37:42.838999+00:00 [running]> on 
host 6eac1f0ec121
   [2022-11-04, 01:31:12 UTC] {taskinstance.py:1590} INFO - Exporting the 
following env vars:
   AIRFLOW_CTX_DAG_OWNER=***
   AIRFLOW_CTX_DAG_ID=XXXX
   AIRFLOW_CTX_TASK_ID=XXXX
   AIRFLOW_CTX_EXECUTION_DATE=2022-11-03T22:37:42.838999+00:00
   AIRFLOW_CTX_TRY_NUMBER=1
   AIRFLOW_CTX_DAG_RUN_ID=manual__2022-11-03T22:37:42.838999+00:00
   [2022-11-04, 01:31:12 UTC] {ssh.py:137} INFO - Creating ssh_client
   ...
   [2022-11-04, 01:31:19 UTC] {transport.py:1874} INFO - Connected (version 
2.0, client OpenSSH_7.9p1)
   [2022-11-04, 01:31:20 UTC] {transport.py:1874} INFO - Authentication 
(publickey) successful!
   [2022-11-04, 01:31:20 UTC] {ssh.py:469} INFO - Running command: .....
   ....
   [2022-11-04, 02:11:20 UTC] {ssh.py:501} INFO - "My last log message"
   [2022-11-04, 07:32:16 UTC] {taskinstance.py:1155} INFO - Dependencies not 
met for <TaskInstance: complete_training_pipeline.start_bbox_training 
manual__2022-11-03T22:37:42.838999+00:00 [running]>, dependency 'Task Instance 
Not Running' FAILED: Task is in the running state
   [2022-11-04, 07:32:16 UTC] {taskinstance.py:1155} INFO - Dependencies not 
met for <TaskInstance: complete_training_pipeline.start_bbox_training 
manual__2022-11-03T22:37:42.838999+00:00 [running]>, dependency 'Task Instance 
State' FAILED: Task is in the 'running' state.
   [2022-11-04, 07:32:16 UTC] {local_task_job.py:103} INFO - Task is not able 
to be run
   ```
   
   
   ## Scheduler Logs
   The CPU Utilization of the instance shows that the task stopped somewhere 
around *02:40:00 - 02:45:00UTC*, but the scheduler logs do not show anything 
around that time. Only the periodic 
   `[2022-11-04 02:45:37,716] {scheduler_job.py:1381} INFO - Resetting orphaned 
tasks for active dag runs`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to