scaoupgrade commented on issue #39717: URL: https://github.com/apache/airflow/issues/39717#issuecomment-2269606223
I have been following this thread recently since we also experienced this issue on airlfow `2.8.4`. We have been running on this version for over two months and this is the first time I see this error. this may suggest that this issue happens less often on `2.8.X`? I see two issues being discussed in this thread: 1. The airflow scheduler complains about: `Executor reports task instance <TaskInstance: (...)> finished (failed) although the task says it's queued. (Info: None) Was the task killed externally?` 2. The airflow worker throws error on: `airflow.exceptions.AirflowException: Celery command failed on host: xxxx with celery_task_id xxxxx` Based on my observation on the logs when the issue happened the other day, these two are not the same issue. Issue 2 happens frequently, I can see about 1600 messages of such error on daily basis, and the number of errors I observe everyday are stable. Thanks @potiuk for providing a fix. https://github.com/apache/airflow/pull/41260/files could address issue 2, but issue 1 should be something else. Because the day the incident happened on our platform, I see a burst of messages like: `Executor reports task instance <TaskInstance: (...)> finished (failed) although the task says it's queued. (Info: None) Was the task killed externally?`, while the error in worker log saying celery command failed remains stable (around 1600 messages). by looking at the scheduler log when the issue happened, I notice this pattern being repeated for the same task multiple times for a given dag: ``` {""log"":""\t<TaskInstance: xxxxx scheduled__2024-07-30T20:43:00+00:00 [scheduled]>"",""stream"":""stdout"",""timestamp"":1722380423388}" {""log"":""\t<TaskInstance: xxxxx scheduled__2024-07-30T20:43:00+00:00 [scheduled]>"",""stream"":""stdout"",""timestamp"":1722380423388}" {""log"":""\t<TaskInstance: xxxxx scheduled__2024-07-30T20:43:00+00:00 [scheduled]>"",""stream"":""stdout"",""timestamp"":1722380423388}" {""log"":""\t<TaskInstance: xxxxx scheduled__2024-07-30T20:43:00+00:00 [scheduled]>"",""stream"":""stdout"",""timestamp"":1722380423388}" {""log"":""\t<TaskInstance: xxxxx scheduled__2024-07-30T20:43:00+00:00 [scheduled]>"",""stream"":""stdout"",""timestamp"":1722380423388}" {""log"":""\t<TaskInstance: xxxxx scheduled__2024-07-30T20:43:00+00:00 [scheduled]>"",""stream"":""stdout"",""timestamp"":1722380423388}" ``` The same line is repeated for the each task in the that dag hundreds of times, which seems to be abnormal. Looks like scheduler dag processor runs into some issue and something failed during the scheduling phase. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
