I have made PR https://github.com/apache/incubator-airflow/pull/2356 <https://github.com/apache/incubator-airflow/pull/2356> for this. The issue went a little bit deeper than I expected.
In the backfills we can loose tasks to execute due to a task setting its own state to NONE if concurrency limits are reached, this makes them fall outside of the scope the backfill is managing hence they will not be executed. Several bugs are the cause for this. Firstly, the state reported by the executor was always reported as success, ie. the return code of the task instance was not propagated. Next to that, if the executor already has a task instance in its queue it will silently ignore the task instance being added. The backfills did not guard against this, thus tasks could get lost here as well. This patch introduces CONCURRENCY_REACHED as an executor state, which will be set if the task exits with EBUSY (16). This allows the backfill to properly handle these tasks and reschedule them. Please note that the CeleryExecutor does not report back on executor states. Please test the patch and report back if it doesn/does not solve the issue. Bolke. > On 8 Jun 2017, at 04:23, Russell Pierce <[email protected]> wrote: > > I hadn't thought of it that way. Given that SubDAGs are scheduled as > backfills, then they'd inherit the same problem. So, the issue I had is > version specific. Thanks for pointing that out Bolke. Do you know the > relevant JIRA Issue off hand? > > On Wed, Jun 7, 2017, 4:28 PM Bolke de Bruin <[email protected]> wrote: > >> It is 1.8.x specific in this case (for backfills). >> >> Sent from my iPhone >> >>> On 7 Jun 2017, at 21:35, Russell Pierce <[email protected]> >> wrote: >>> >>> Probably more of a configuration constellation issue than version >> specific >>> or even an 'issue' per se. As noted, on restart the scheduler reschedules >>> everything. I had a heavy SubDAG that when rescheduled could produce many >>> extra tasks and a small fixed number of Celery workers. So, the scheduled >>> tasks wouldn't be done by the time of the scheduler restart and then the >>> scheduler would reschedule the SubDAG... debugging hilarity followed from >>> there. >>> >>>> On Wed, Jun 7, 2017, 10:57 AM Jason Chen <[email protected]> >> wrote: >>>> >>>> I am using Airflow 1.7.1.3 with CeleryExecutor, but not run into this >>>> issue. >>>> I am wondering if this issue is only for 1.8.x ? >>>> >>>> On Wed, Jun 7, 2017 at 8:34 AM, Russell Pierce < >> [email protected] >>>>> >>>> wrote: >>>> >>>>> Depending on how fast you can clear down your queue, -n can be harmful >>>> and >>>>> really stack up your celery queue. Keep an eye on your queue depth of >> you >>>>> see a ton of messages about the task already having been run. >>>>> >>>>> On Mon, Jun 5, 2017, 9:18 AM Josef Samanek <[email protected]> >>>> wrote: >>>>> >>>>>> Hey. Thanks for the answer. I previously also tried to run scheduler >> -n >>>>>> 10, but it was back when I was still using LocalExecutor. And it did >>>> not >>>>>> help. I have not yet tried to do it with CeleryExecutor, so I might. >>>>>> >>>>>> Still, I would prefer to find an actual solution for the underlying >>>>>> problem, not just a workaround (eventhough a working workaround is >> also >>>>>> appreciated). >>>>>> >>>>>> Best regards, >>>>>> Joe >>>>>> >>>>>> On 2017-06-02 00:10 (+0200), Alex Guziel <[email protected]. >>>>> INVALID> >>>>>> wrote: >>>>>>> We've noticed this with celery, relating to this >>>>>>> https://github.com/celery/celery/issues/3765 >>>>>>> >>>>>>> We also use `-n 5` option on the scheduler so it restarts every 5 >>>> runs, >>>>>>> which will reset all queued tasks. >>>>>>> >>>>>>> Best, >>>>>>> Alex >>>>>>> >>>>>>> On Thu, Jun 1, 2017 at 2:18 PM, Josef Samanek < >>>> [email protected] >>>>>> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi! >>>>>>>> >>>>>>>> We have a problem with our airflow. Sometimes, several tasks get >>>>> queued >>>>>>>> but they never get run and remain in Queud state forever. Other >>>> tasks >>>>>> from >>>>>>>> the same schedule interval run. And next schedule interval runs >>>>>> normally >>>>>>>> too. But these several tasks remain queued. >>>>>>>> >>>>>>>> We are using Airflow 1.8.1. Currently with CeleryExecutor and >>>> redis, >>>>>> but >>>>>>>> we had the same problem with LocalExecutor as well (actually >>>>> switching >>>>>> to >>>>>>>> Celery helped quite a bit, the problem now happens way less often, >>>>> but >>>>>>>> still it happens). We have 18 DAGs total, 13 active. Some have just >>>>> 1-2 >>>>>>>> tasks, but some are more complex, like 8 tasks or so and with >>>>>> upstreams. >>>>>>>> There are also ExternalTaskSensor tasks used. >>>>>>>> >>>>>>>> I tried playing around with DAG configurations (limiting >>>> concurrency, >>>>>>>> max_active_runs, ...), tried switching off some DAGs completely >>>> (not >>>>>> all >>>>>>>> but most) etc., so far nothing helped. Right now, I am not really >>>>> sure, >>>>>>>> what else to try to identify a solve the issue. >>>>>>>> >>>>>>>> I am getting a bit desperate, so I would really appreciate any help >>>>>> with >>>>>>>> this. Thank you all in advance! >>>>>>>> >>>>>>>> Joe >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>
