[jira] [Commented] (AIRFLOW-5071) Thousand os Executor reports task instance X finished (success) although the task says its queued. Was the task killed externally?

ASF GitHub Bot (Jira) Thu, 10 Sep 2020 05:09:10 -0700


    [ 
https://issues.apache.org/jira/browse/AIRFLOW-5071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17193576#comment-17193576
 ]


ASF GitHub Bot commented on AIRFLOW-5071:
-----------------------------------------

sgrzemski-ias edited a comment on issue #10790:
URL: https://github.com/apache/airflow/issues/10790#issuecomment-690223705


   @yuqian90 
   > I also tried to tweak these parameters. They don't seem to matter much as 
far as this error is concerned:
   > 
   > ```
   > parallelism = 1024
   > dag_concurrency = 128
   > max_threads = 8
   > ```
   > 
   > The way to reproduce this issue seems to be to create a DAG with a bunch 
of parallel `reschedule` sensors. And make the DAG slow to import. For example, 
like this. If we add a `time.sleep(30)` at the end to simulate the experience 
of slow-to-import DAGs, this error happens a lot for such sensors. You may also 
need to tweak the `dagbag_import_timeout` and `dag_file_processor_timeout` if 
adding the `sleep` causes dags to fail to import altogether.
   
   Those parameters won't help you much. I was struggling to somehow workaround 
this issue and I believe I've found the right solution now. In my case the 
biggest hint while debugging was not scheduler/worker logs but the Celery 
Flower Web UI. We have a setup of 3 Celery workers, 4 CPU each. It often 
happened that Celery was running 8 or more python reschedule sensors on one 
worker but 0 on the others and that was the exact time when sensors started to 
fail. There are two Celery settings that are responsible for this behavior: 
```worker_concurrency``` with a default value of "16" and 
```worker_autoscale``` with a default value of "16,12" (it basically means that 
minimum Celery process # on the worker is 12 and can be scaled up to 16). With 
those set with default values Celery was configured to load up to 16 tasks 
(mainly reschedule sensors) to one node. After setting ```worker_concurrency``` 
to match the CPU number and ```worker_autoscale``` to "4,2" the problem is 
**literally gone**. Maybe that might be anothe clue for @turbaszek.
   
   I've been trying a lot to setup a local docker compose file with scheduler, 
webserver, flower, postgres and RabbitMQ as a Celery backend but I was not able 
to replicate the issue as well. I tried to start a worker container with 
limited CPU to somehow imitate this situation, but I failed. There are in fact 
tasks killed and shown as failed in Celery Flower, but not with the "killed 
externally" reason. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


> Thousand os Executor reports task instance X finished (success) although the 
> task says its queued. Was the task killed externally?
> ----------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: AIRFLOW-5071
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-5071
>             Project: Apache Airflow
>          Issue Type: Bug
>          Components: DAG, scheduler
>    Affects Versions: 1.10.3
>            Reporter: msempere
>            Priority: Critical
>             Fix For: 1.10.12
>
>         Attachments: image-2020-01-27-18-10-29-124.png, 
> image-2020-07-08-07-58-42-972.png
>
>
> I'm opening this issue because since I update to 1.10.3 I'm seeing thousands 
> of daily messages like the following in the logs:
>  
> ```
>  {{__init__.py:1580}} ERROR - Executor reports task instance <TaskInstance: X 
> 2019-07-29 00:00:00+00:00 [queued]> finished (success) although the task says 
> its queued. Was the task killed externally?
> {{jobs.py:1484}} ERROR - Executor reports task instance <TaskInstance: X 
> 2019-07-29 00:00:00+00:00 [queued]> finished (success) although the task says 
> its queued. Was the task killed externally?
> ```
> -And looks like this is triggering also thousand of daily emails because the 
> flag to send email in case of failure is set to True.-
> I have Airflow setup to use Celery and Redis as a backend queue service.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (AIRFLOW-5071) Thousand os Executor reports task instance X finished (success) although the task says its queued. Was the task killed externally?

Reply via email to