Re: The "no_status" state

Pierre Jeambrun Fri, 20 Oct 2023 06:35:20 -0700

Seems like a good idea. Some kind of "task diagnosis", in case the state is
not settled to give more context to users.


Happy to help on that one as well. I also think that a small AIP is
required, the scope of change could be substantial.

Best regards,
Pierre

Le jeu. 19 oct. 2023 à 17:05, Brent Bovenzi <br...@astronomer.io.invalid> a
écrit :

> Like what Jarek said, some of these dependencies might take a lot of work
> to surface correctly. But I am happy to improve the grid and graph to show
> more information, like integrating rendered_templates and more details into
> the Grid view. Mind to open a github issue for some of those smaller tasks
> so I don't forget to do it?
>
> I am also playing with some ways to show datasets and other external
> dependencies better in grid/graph view too.
>
> On Thu, Oct 19, 2023 at 10:48 AM Jarek Potiuk <ja...@potiuk.com> wrote:
>
> > I think it will be tricky to get all the reasons surfaced to the user why
> > the task is not run. But surfacing it to the user is indeed a good idea.
> > Currently this is only done by this FAQ response - showing possible
> reasons
> >
> >
> https://airflow.apache.org/docs/apache-airflow/stable/faq.html#why-is-task-not-getting-scheduled
> > - and I believe this is not a complete list after a number of
> > features implemented since this FAQ was written.
> >
> > The question is open I think (and agree with Jens comments this should
> be a
> > small "AIP" level) is which of those we are able to deterministically
> > detect. A bit of a problem here is (also as Jens mentioned) that in many
> > cases the task in DB is simply skipped during scheduler because of some
> of
> > the reasons explained  in the FAQ (and some not explained). Sometimes
> > simply the task will not be scheduled because the scheduler has not yet
> had
> > a chance to look at it due to performance reasons. That's why I believe
> we
> > really do not need a new status, but more automated analysis - in the
> "more
> > details" tab, when the user specifically asks for it. That could give the
> > user possible reasons for this particular task. This would be much better
> > to do it on "individual" task level when users asks "why this particular
> > task is not scheduled" - because then you could query the DB and figure
> it
> > out, recording and determining the information upfront might not be
> > possible from the performance reasons - simply because scheduler never
> > really looks at all possible tasks (that would be prohibitively
> expensive)
> > - instead it effectively finds a subset the "good candidates to
> schedule" -
> > which is much smaller set to run queries for.
> >
> > Some of that could be deterministically determined today. For example the
> > "upstream tasks are still running". Some of that might be a little "racy"
> > though - because simply the system is continuously running - so what
> caused
> > the task to not be scheduled in the previous pass of scheduler, might not
> > be valid any more (but there might still be other reasons). I think the
> > difficult ones might require additional information recorded by the
> > scheduler (for example scheduler recording the fact that it has completed
> > the last pass with still remaining dag runs to look at or fact that the
> > number of tasks seen in the last pass reached the global concurrency
> > limits). But some of this might not be even possible to determine by
> > scheduler without some major query changes (for example scheduler will
> run
> > the query including pools size - the way how pool query is done that you
> > simply select "pool size" eligible tasks and you have no idea if there
> were
> > more that there are more tasks that were excluded from the result (nor
> > which tasks they were). This is where looking at individual tasks and
> > working out "backwards" - guessing why might be needed. But  possibly it
> > could be helped with some extra information stored by the scheduler.
> >
> > I think we will not have a complete and fully accurate picture, but I
> think
> > iteratively we could get this better and better.
> >
> > J
> >
> >
> > On Mon, Oct 16, 2023 at 11:55 PM Oliveira, Niko
> > <oniko...@amazon.com.invalid>
> > wrote:
> >
> > > I really like this idea as well! One of the _the most common_
> questions I
> > > get from people managing an Airflow env is "Why is my task stuck in
> state
> > > X". Anything we can do to make that more discoverable and user
> friendly,
> > > especially in the UI instead of (or in addition to) logs would be
> > fantastic!
> > >
> > > Thanks to Jens for having a think and pointing out a lot of the
> > > implications, I agree a quick AIP might be nice for this one.
> > >
> > > Cheers,
> > > Niko
> > >
> > > ________________________________
> > > From: Scheffler Jens (XC-DX/ETV5) <jens.scheff...@de.bosch.com
> .INVALID>
> > > Sent: Thursday, September 28, 2023 10:36:00 PM
> > > To: dev@airflow.apache.org
> > > Subject: RE: [EXTERNAL] [COURRIEL EXTERNE] The "no_status" state
> > >
> > > CAUTION: This email originated from outside of the organization. Do not
> > > click links or open attachments unless you can confirm the sender and
> > know
> > > the content is safe.
> > >
> > >
> > >
> > > AVERTISSEMENT: Ce courrier électronique provient d’un expéditeur
> externe.
> > > Ne cliquez sur aucun lien et n’ouvrez aucune pièce jointe si vous ne
> > pouvez
> > > pas confirmer l’identité de l’expéditeur et si vous n’êtes pas certain
> > que
> > > le contenu ne présente aucun risque.
> > >
> > >
> > >
> > > Hi Ryan,
> > >
> > > I really like the idea of exposing some more scheduler details. More
> > > transparency in scheduling also in the UI would help the user in (1)
> > seeing
> > > and understanding what is going on and (2) reduces the need to crawl
> for
> > > logs and raise support tickets if status is “strange”. I often also see
> > > this as a problem. This is also sometimes generating a bit of “mis
> trust”
> > > in the scheduler stability.
> > >
> > > From point of scheduler “overhead” I assume as long as we are not
> making
> > a
> > > “full scan” just to ensure that each and every task is always
> up-to-date
> > > (Scheduler stops processing today after enough tasks have been
> processes
> > in
> > > a loop or if scheduling limits are reached) this is OK for me and on
> the
> > > code side does not seem to be much overhead.
> > > I have a bit of fear on the other hand that very many frequent updates
> > > need to happen on the DB as another state would need to be written. So
> > more
> > > DB round trips are needed. This might hit performance for large DAGs or
> > > cases where DAGs are scheduled. So at least it would need to filter to
> > > update the state to DB only if changed to keep performance impact
> > minimal.
> > >
> > > From point of naming I still think “no status” is good to indicate that
> > > scheduler did not digest anything, maybe task was never looked at
> because
> > > scheduler actually is really stuck or too busy getting there. I would
> > > propose if scheduler passes along a task and decides that it is not
> ready
> > > to schedule to have an additional state calling e.g. “not_ready” in the
> > > state model between “none” and “scheduled”.
> > >
> > > Finally on the other hand, adding another state in the model, I am not
> > > sure whether this 100% will help in the use case described by you.
> Still
> > > you might need to scratch your head a while if taking a look on UI
> that a
> > > DAG is “stuck” until you realize all the options you have configured.
> > > Exposing a “why is stuck” in a user friendly manner might be another
> > level
> > > of complexity in this case.
> > >
> > > As the state model might touch a lot of code and there might be a
> longer
> > > discussion needed, would it be a need to raise an AIP for this? There
> > might
> > > be a lot more (external, provider??) dependencies adjusting the state
> > model?
> > >
> > > Mit freundlichen Grüßen / Best regards
> > >
> > > Jens Scheffler
> > >
> > > Deterministik open Loop (XC-DX/ETV5)
> > > Robert Bosch GmbH | Hessbruehlstraße 21 | 70565 Stuttgart-Vaihingen |
> > > GERMANY | www.bosch.com<http://www.bosch.com>
> > > Tel. +49 711 811-91508 | Mobil +49 160 90417410 |
> > > jens.scheff...@de.bosch.com<mailto:jens.scheff...@de.bosch.com>
> > >
> > > Sitz: Stuttgart, Registergericht: Amtsgericht Stuttgart, HRB 14000;
> > > Aufsichtsratsvorsitzender: Prof. Dr. Stefan Asenkerschbaumer;
> > > Geschäftsführung: Dr. Stefan Hartung,
> > > Dr. Christian Fischer, Dr. Markus Forschner, Stefan Grosch, Dr. Markus
> > > Heyn, Dr. Tanja Rückert
> > > 
> > > From: Ryan Hatter <ryan.hat...@astronomer.io.INVALID>
> > > Sent: Donnerstag, 28. September 2023 23:59
> > > To: dev@airflow.apache.org
> > > Subject: The "no_status" state
> > >
> > > Over the last couple weeks I've come across a rather tricky problem a
> few
> > > times. One DAG run gets "stuck" in the queued state, while subsequent
> DAG
> > > runs will be stuck running (screenshot below). One of these issues was
> > > caused by `max_active_runs` being met when a task instance from a
> > > previously run DAG was cleared, and one of the tasks had
> > > `depends_on_past=True`. This caused the DAG run to be stuck in queued
> in
> > > perpetuity until it was realized that the task that wasn't getting
> > > scheduled needed the failed task in the preceding DAG run to be re-run,
> > > which in turn causes the stuck running DAG runs to be stuck in running.
> > > which caused quite a bit of confusion and stress.
> > >
> > > Given that Airflow is pretty burnt out on task instance states and
> > colors,
> > > I propose replacing "no_status" with "dependencies_not_met" and
> surfacing
> > > dependencies in the grid view instead of forcing users to already know
> > > where to look (i.e. "more details" task instance details). Now that I
> > typed
> > > it out, I'm not sure there should be a reason for the "more details"
> > button
> > > and not just laying out all of a task instance's details in the grid
> view
> > > similar to how the graph and code views are now included in the grid
> > view.
> > >
> > > Anyway, I wanted to solicit feedback before I open an issue / start
> work
> > > on this.
> > >
> > > [cid:ii_ln3phzoe0]
> > >
> >
>

Re: The "no_status" state

Reply via email to