V0lantis commented on issue #10790:
URL: https://github.com/apache/airflow/issues/10790#issuecomment-1177296853

   Hello here :wave:, 
   I would very much like to help, experiencing the exact same issue with the 
sensor in our kubernetes cluster with Celery executor. Just to add our own 
experience:
   
   We are running the architecture with a shared NFS which is hosting our dags 
and logs (among other things). The speed and allowed throughput in the shared 
fileSystem is huge bottleneck for the scheduler (since it needs to parse the 
dags quite often). We noticed that the issue with the sensors and the log 
message `INFO - Task is not able to run` appeared when we spent up the entire 
credits AWS was allowing us. We therefore switched to a speed enhanced 
FileSystem (FSx) for those who are wondering, and the issue has almost 
disappeared, though still there (which is why I would very much like to help)
   
   But we are also experiencing a similar issue though I am not exactly sure it 
the same one.
   For example, sometime a task will be taken by a celery worker but will not 
output any logs file. Here is what I found:
   
   1. A task is in a running state for a while without anything happening.
   2. When we take a look at the log, we see the following:
   
   ```log
   *** Reading remote log from 
s3://my.bucket/airflow/prod/my_dag/one_of_dag_task/2022-07-03T12:00:00+00:00/1.log.
   [2022-07-04, 20:58:08 UTC] {taskinstance.py:1037} INFO - Dependencies not 
met for <TaskInstance: my_dag.one_of_dag_task 
scheduled__2022-07-03T12:00:00+00:00 [running]>, dependency 'Task Instance Not 
Running' FAILED: Task is in the running state
   [2022-07-04, 20:58:08 UTC] {taskinstance.py:1037} INFO - Dependencies not 
met for <TaskInstance: my_dag.one_of_dag_task 
scheduled__2022-07-03T12:00:00+00:00 [running]>, dependency 'Task Instance 
State' FAILED: Task is in the 'running' state.
   [2022-07-04, 20:58:08 UTC] {local_task_job.py:99} INFO - Task is not able to 
be run
   
   [2022-07-04, 20:58:08 UTC] {taskinstance.py:1037} INFO - Dependencies not 
met for <TaskInstance: my_dag.one_of_dag_task 
scheduled__2022-07-03T12:00:00+00:00 [running]>, dependency 'Task Instance Not 
Running' FAILED: Task is in the running state
   [2022-07-04, 20:58:08 UTC] {taskinstance.py:1037} INFO - Dependencies not 
met for <TaskInstance: my_dag.one_of_dag_task 
scheduled__2022-07-03T12:00:00+00:00 [running]>, dependency 'Task Instance 
State' FAILED: Task is in the 'running' state.
   [2022-07-04, 20:58:08 UTC] {local_task_job.py:99} INFO - Task is not able to 
be run
   [2022-07-04, 23:21:07 UTC] {spark.py:201} INFO - Driver 
application_1655985151726_32839 has finished
   [2022-07-04, 23:21:07 UTC] {taskinstance.py:1288} INFO - Marking task as 
SUCCESS. dag_id=my_dag, task_id=one_of_dag_task, 
execution_date=20220703T120000, start_date=20220704T145813, 
end_date=20220704T232107
   [2022-07-04, 23:21:07 UTC] {local_task_job.py:154} INFO - Task exited with 
return code 0
   [2022-07-04, 23:21:08 UTC] {local_task_job.py:264} INFO - 1 downstream tasks 
scheduled from follow-on schedule check
   ```
   
   One could argue, why am I making so much for trouble if the task has been 
successfully run ? Because sometime, the task simply doesn't ever finish, and 
stay in _running_ state indefinitely, which is hard to spot (until we have some 
alert because the dag has a lot of delay)
   
   3. By digging through the log, I was able to find one worker who picked the 
task the **2022-07-04 14:58:05,058**:
   
   ```log
   prod worker [2022-07-04 14:58:05,058: INFO/MainProcess] Task 
airflow.executors.celery_executor.execute_command[661f0a53-129b-4856-8db0-5632ff466833]
 received
   prod worker [2022-07-04 14:58:05,069: INFO/ForkPoolWorker-828] Executing 
command in Celery: ['airflow', 'tasks', 'run', 'my_dag', 'one_of_dag_task', 
'scheduled__2022-07-03T12:00:00+00:00', '--local', '--subdir', 
'DAGS_FOLDER/my_dag/one_of_dag_task.py']
   prod worker [2022-07-04 14:58:05,929: WARNING/ForkPoolWorker-828] Running 
<TaskInstance: my_dag.one_of_dag_task scheduled__2022-07-03T12:00:00+00:00 
[queued]> on host airflow-worker-cfd5b7655-7hlcz
   ```
   
   And then no log given by the worker, saying for example that the task went 
from `queued` to `running` state
   
   4. But then, later :
   
   ```log
   prod worker [2022-07-04 20:58:05,273: INFO/MainProcess] Task 
airflow.executors.celery_executor.execute_command[661f0a53-129b-4856-8db0-5632ff466833]
 received
   prod worker [2022-07-04 20:58:05,282: INFO/ForkPoolWorker-2634] Executing 
command in Celery: ['airflow', 'tasks', 'run', 'my_dag', 'one_of_dag_task', 
'scheduled__2022-07-03T12:00:00+00:00', '--local', '--subdir', 
'DAGS_FOLDER/my_dag/one_of_dag_task.py']
   prod worker [2022-07-04 20:58:05,282: INFO/ForkPoolWorker-2634] Celery task 
ID: 661f0a53-129b-4856-8db0-5632ff466833
   prod worker [2022-07-04 20:58:05,306: INFO/ForkPoolWorker-2634] Filling up 
the DagBag from /opt/airflow/dags/my_dag/one_of_dag_task.py
   prod worker [2022-07-04 20:58:06,023: WARNING/ForkPoolWorker-2634] Running 
<TaskInstance: my_dag.one_of_dag_task scheduled__2022-07-03T12:00:00+00:00 
[running]> on host data-platform-airflow-worker-cfd5b7655-68n2 
   ```
   
   5. And finally in the first worker, we see:
   ```log
   data-eks-prod worker [2022-07-04 23:21:09,542: INFO/ForkPoolWorker-828] Task 
airflow.executors.celery_executor.execute_command[661f0a53-129b-4856-8db0-5632ff466833]
 succeeded in 30184.482223104016s: None
   ```
   
   (_This is a very long task basically waiting for a spark job to termiate_)
   
   Here, it ended well, but sometime, there is just nothing happening. How is 
it possible that no logs are generated?
   
   I am still trying to understand the complexity of Airflow to understand this 
issue, but I wanted to participate with what I have been able to find so far. 
Some more informations about our architectures:
   
   **Environment:**
   
   Cloud provider or hardware configuration: AWS
   OS (e.g. from /etc/os-release): centos rhel fedora
   Others: Redis, CeleryExecutor, Airflow 2.2.5
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to