We saw the same thing. Only a few truly active tasks yet the task queue was filling up with pending tasks.
Best, Trent On Tue, Aug 28, 2018 at 12:47 AM Vardan Gupta <vardangupta...@gmail.com> wrote: > Hi Trent, > > Thanks for replying. Though you're suggesting that there might be case > where we might be hitting caps, but on our side, there are hardly any > concurrent tasks, rarely 1-2 at a time with parallelism set to 50. But > yeah, we'll just increase the parallelism and and see if that solves the > problem too. > > Thanks, > Vardan Gupta > > On Tue, Aug 28, 2018 at 11:17 AM Trent Robbins <robbi...@gmail.com> wrote: > > > Hi Vardan, > > > > We had this issue - I recommend increasing the parallelism config > variable > > to something like 128 or 512. I have no idea what side effects this could > > have. So far, none. This happened to us with LocalExecutor and our > > monitoring showed a clear issue with hitting a cap on number of > concurrent > > tasks tasks. I probably should have reported it, but we still aren't sure > > what happened and have not investigated why those tasks are not getting > > kicked back up into the queue or whatever. > > > > You may need to increase other config variables, too, if they also cause > > you to hit caps. Some people are conservative about these variables. If > you > > are feeling conservative, you can get some better telemetry into this > with > > prometheus and grafana. We followed this route but resolved to just set > the > > cap very high and resolve any side effects afterwards. > > > > Best, > > Trent > > > > > > On Mon, Aug 27, 2018 at 21:09 vardangupta...@gmail.com < > > vardangupta...@gmail.com> wrote: > > > > > Hi Everyone, > > > > > > Since last 2 weeks, we're facing an issue with LocalExecutor setup of > > > Airflow v1.9(MySQL as metastore) where in a DAG if retry has been > > > configured and initial try_number gets failed, then nearly 8 out of 10 > > > times, task will get stuck in up_for_retry state, in fact there is no > > > running state coming after Scheduled>Queued in TI. In Job table entry > > gets > > > successful within fraction of second and failed entry gets logged in > > > task_fail table without task even reaching to operator code and as a > > result > > > we get aemail alert saying > > > > > > ``` > > > Try 2 out of 4 > > > Exception: > > > Executor reports task instance %s finished (%s) although the task says > > its > > > %s. Was the task killed externally? > > > ``` > > > > > > But when default value of job_heartbeat_sec changed from 5 to 30 > seconds( > > > https://groups.google.com/forum/#!topic/airbnb_airflow/hTXKFw2XFx0 > > > mentioned by Max sometimes back in 2016 for healthy supervision), this > > > issue stops arising. But we're still clueless how this new > configuration > > > actually solved/suppressed the issue, any key information around it > would > > > really help here. > > > > > > Regards, > > > Vardan Gupta > > > > > -- > > (Sent from cellphone) > -- (Sent from cellphone)