Hi Trent,

Thanks for replying. Though you're suggesting that there might be case
where we might be hitting caps, but on our side, there are hardly any
concurrent tasks, rarely 1-2 at a time with parallelism set to 50. But
yeah, we'll just increase the parallelism and and see if that solves the
problem too.

Thanks,
Vardan Gupta

On Tue, Aug 28, 2018 at 11:17 AM Trent Robbins <robbi...@gmail.com> wrote:

> Hi Vardan,
>
> We had this issue - I recommend increasing the parallelism config variable
> to something like 128 or 512. I have no idea what side effects this could
> have. So far, none. This happened to us with LocalExecutor and our
> monitoring showed a clear issue with hitting a cap on number of concurrent
> tasks tasks. I probably should have reported it, but we still aren't sure
> what happened and have not investigated why those tasks are not getting
> kicked back up into the queue or whatever.
>
> You may need to increase other config variables, too, if they also cause
> you to hit caps. Some people are conservative about these variables. If you
> are feeling conservative, you can get some better telemetry into this with
> prometheus and grafana. We followed this route but resolved to just set the
> cap very high and resolve any side effects afterwards.
>
> Best,
> Trent
>
>
> On Mon, Aug 27, 2018 at 21:09 vardangupta...@gmail.com <
> vardangupta...@gmail.com> wrote:
>
> > Hi Everyone,
> >
> > Since last 2 weeks, we're facing an issue with LocalExecutor setup of
> > Airflow v1.9(MySQL as metastore) where in a DAG if retry has been
> > configured and initial try_number gets failed, then nearly 8 out of 10
> > times, task will get stuck in up_for_retry state, in fact there is no
> > running state coming after Scheduled>Queued in TI. In Job table entry
> gets
> > successful within fraction of second and failed entry gets logged in
> > task_fail table without task even reaching to operator code and as a
> result
> > we get aemail alert saying
> >
> > ```
> > Try 2 out of 4
> > Exception:
> > Executor reports task instance %s finished (%s) although the task says
> its
> > %s. Was the task killed externally?
> > ```
> >
> > But when default value of job_heartbeat_sec changed from 5 to 30 seconds(
> > https://groups.google.com/forum/#!topic/airbnb_airflow/hTXKFw2XFx0
> > mentioned by Max sometimes back in 2016 for healthy supervision), this
> > issue stops arising. But we're still clueless how this new configuration
> > actually solved/suppressed the issue, any key information around it would
> > really help here.
> >
> > Regards,
> > Vardan Gupta
> >
> --
> (Sent from cellphone)

Reply via email to