V0lantis commented on issue #10790:
URL: https://github.com/apache/airflow/issues/10790#issuecomment-1177296853
Hello here :wave:,
I would very much like to help, experiencing the exact same issue with the
sensor in our kubernetes cluster with Celery executor. Just to add our own
experience:
We are running the architecture with a shared NFS which is hosting our dags
and logs (among other things). The speed and allowed throughput in the shared
fileSystem is huge bottleneck for the scheduler (since it needs to parse the
dags quite often). We noticed that the issue with the sensors and the log
message `INFO - Task is not able to run` appeared when we spent up the entire
credits AWS was allowing us. We therefore switched to a speed enhanced
FileSystem (FSx) for those who are wondering, and the issue has almost
disappeared, though still there (which is why I would very much like to help)
But we are also experiencing a similar issue though I am not exactly sure it
the same one.
For example, sometime a task will be taken by a celery worker but will not
output any logs file. Here is what I found:
1. A task is in a running state for a while without anything happening.
2. When we take a look at the log, we see the following:
```log
*** Reading remote log from
s3://my.bucket/airflow/prod/my_dag/one_of_dag_task/2022-07-03T12:00:00+00:00/1.log.
[2022-07-04, 20:58:08 UTC] {taskinstance.py:1037} INFO - Dependencies not
met for <TaskInstance: my_dag.one_of_dag_task
scheduled__2022-07-03T12:00:00+00:00 [running]>, dependency 'Task Instance Not
Running' FAILED: Task is in the running state
[2022-07-04, 20:58:08 UTC] {taskinstance.py:1037} INFO - Dependencies not
met for <TaskInstance: my_dag.one_of_dag_task
scheduled__2022-07-03T12:00:00+00:00 [running]>, dependency 'Task Instance
State' FAILED: Task is in the 'running' state.
[2022-07-04, 20:58:08 UTC] {local_task_job.py:99} INFO - Task is not able to
be run
[2022-07-04, 20:58:08 UTC] {taskinstance.py:1037} INFO - Dependencies not
met for <TaskInstance: my_dag.one_of_dag_task
scheduled__2022-07-03T12:00:00+00:00 [running]>, dependency 'Task Instance Not
Running' FAILED: Task is in the running state
[2022-07-04, 20:58:08 UTC] {taskinstance.py:1037} INFO - Dependencies not
met for <TaskInstance: my_dag.one_of_dag_task
scheduled__2022-07-03T12:00:00+00:00 [running]>, dependency 'Task Instance
State' FAILED: Task is in the 'running' state.
[2022-07-04, 20:58:08 UTC] {local_task_job.py:99} INFO - Task is not able to
be run
[2022-07-04, 23:21:07 UTC] {spark.py:201} INFO - Driver
application_1655985151726_32839 has finished
[2022-07-04, 23:21:07 UTC] {taskinstance.py:1288} INFO - Marking task as
SUCCESS. dag_id=my_dag, task_id=one_of_dag_task,
execution_date=20220703T120000, start_date=20220704T145813,
end_date=20220704T232107
[2022-07-04, 23:21:07 UTC] {local_task_job.py:154} INFO - Task exited with
return code 0
[2022-07-04, 23:21:08 UTC] {local_task_job.py:264} INFO - 1 downstream tasks
scheduled from follow-on schedule check
```
One could argue, why am I making so much for trouble if the task has been
successfully run ? Because sometime, the task simply doesn't ever finish, and
stay in _running_ state indefinitely, which is hard to spot (until we have some
alert because the dag has a lot of delay)
3. By digging through the log, I was able to find one worker who picked the
task the **2022-07-04 14:58:05,058**:
```log
prod worker [2022-07-04 14:58:05,058: INFO/MainProcess] Task
airflow.executors.celery_executor.execute_command[661f0a53-129b-4856-8db0-5632ff466833]
received
prod worker [2022-07-04 14:58:05,069: INFO/ForkPoolWorker-828] Executing
command in Celery: ['airflow', 'tasks', 'run', 'my_dag', 'one_of_dag_task',
'scheduled__2022-07-03T12:00:00+00:00', '--local', '--subdir',
'DAGS_FOLDER/my_dag/one_of_dag_task.py']
prod worker [2022-07-04 14:58:05,929: WARNING/ForkPoolWorker-828] Running
<TaskInstance: my_dag.one_of_dag_task scheduled__2022-07-03T12:00:00+00:00
[queued]> on host airflow-worker-cfd5b7655-7hlcz
```
And then no log given by the worker, saying for example that the task went
from `queued` to `running` state
4. But then, later :
```log
prod worker [2022-07-04 20:58:05,273: INFO/MainProcess] Task
airflow.executors.celery_executor.execute_command[661f0a53-129b-4856-8db0-5632ff466833]
received
prod worker [2022-07-04 20:58:05,282: INFO/ForkPoolWorker-2634] Executing
command in Celery: ['airflow', 'tasks', 'run', 'my_dag', 'one_of_dag_task',
'scheduled__2022-07-03T12:00:00+00:00', '--local', '--subdir',
'DAGS_FOLDER/my_dag/one_of_dag_task.py']
prod worker [2022-07-04 20:58:05,282: INFO/ForkPoolWorker-2634] Celery task
ID: 661f0a53-129b-4856-8db0-5632ff466833
prod worker [2022-07-04 20:58:05,306: INFO/ForkPoolWorker-2634] Filling up
the DagBag from /opt/airflow/dags/my_dag/one_of_dag_task.py
prod worker [2022-07-04 20:58:06,023: WARNING/ForkPoolWorker-2634] Running
<TaskInstance: my_dag.one_of_dag_task scheduled__2022-07-03T12:00:00+00:00
[running]> on host data-platform-airflow-worker-cfd5b7655-68n2
```
5. And finally in the first worker, we see:
```log
data-eks-prod worker [2022-07-04 23:21:09,542: INFO/ForkPoolWorker-828] Task
airflow.executors.celery_executor.execute_command[661f0a53-129b-4856-8db0-5632ff466833]
succeeded in 30184.482223104016s: None
```
(_This is a very long task basically waiting for a spark job to termiate_)
Here, it ended well, but sometime, there is just nothing happening. How is
it possible that no logs are generated?
I am still trying to understand the complexity of Airflow to understand this
issue, but I wanted to participate with what I have been able to find so far.
Some more informations about our architectures:
**Environment:**
Cloud provider or hardware configuration: AWS
OS (e.g. from /etc/os-release): centos rhel fedora
Others: Redis, CeleryExecutor, Airflow 2.2.5
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]