Re: The "no_status" state

Jarek Potiuk Thu, 19 Oct 2023 07:48:15 -0700

I think it will be tricky to get all the reasons surfaced to the user why
the task is not run. But surfacing it to the user is indeed a good idea.
Currently this is only done by this FAQ response - showing possible reasons
https://airflow.apache.org/docs/apache-airflow/stable/faq.html#why-is-task-not-getting-scheduled
- and I believe this is not a complete list after a number of
features implemented since this FAQ was written.


The question is open I think (and agree with Jens comments this should be a
small "AIP" level) is which of those we are able to deterministically
detect. A bit of a problem here is (also as Jens mentioned) that in many
cases the task in DB is simply skipped during scheduler because of some of
the reasons explained  in the FAQ (and some not explained). Sometimes
simply the task will not be scheduled because the scheduler has not yet had
a chance to look at it due to performance reasons. That's why I believe we
really do not need a new status, but more automated analysis - in the "more
details" tab, when the user specifically asks for it. That could give the
user possible reasons for this particular task. This would be much better
to do it on "individual" task level when users asks "why this particular
task is not scheduled" - because then you could query the DB and figure it
out, recording and determining the information upfront might not be
possible from the performance reasons - simply because scheduler never
really looks at all possible tasks (that would be prohibitively expensive)
- instead it effectively finds a subset the "good candidates to schedule" -
which is much smaller set to run queries for.

Some of that could be deterministically determined today. For example the
"upstream tasks are still running". Some of that might be a little "racy"
though - because simply the system is continuously running - so what caused
the task to not be scheduled in the previous pass of scheduler, might not
be valid any more (but there might still be other reasons). I think the
difficult ones might require additional information recorded by the
scheduler (for example scheduler recording the fact that it has completed
the last pass with still remaining dag runs to look at or fact that the
number of tasks seen in the last pass reached the global concurrency
limits). But some of this might not be even possible to determine by
scheduler without some major query changes (for example scheduler will run
the query including pools size - the way how pool query is done that you
simply select "pool size" eligible tasks and you have no idea if there were
more that there are more tasks that were excluded from the result (nor
which tasks they were). This is where looking at individual tasks and
working out "backwards" - guessing why might be needed. But  possibly it
could be helped with some extra information stored by the scheduler.

I think we will not have a complete and fully accurate picture, but I think
iteratively we could get this better and better.

J


On Mon, Oct 16, 2023 at 11:55 PM Oliveira, Niko <oniko...@amazon.com.invalid>
wrote:

> I really like this idea as well! One of the _the most common_ questions I
> get from people managing an Airflow env is "Why is my task stuck in state
> X". Anything we can do to make that more discoverable and user friendly,
> especially in the UI instead of (or in addition to) logs would be fantastic!
>
> Thanks to Jens for having a think and pointing out a lot of the
> implications, I agree a quick AIP might be nice for this one.
>
> Cheers,
> Niko
>
> ________________________________
> From: Scheffler Jens (XC-DX/ETV5) <jens.scheff...@de.bosch.com.INVALID>
> Sent: Thursday, September 28, 2023 10:36:00 PM
> To: dev@airflow.apache.org
> Subject: RE: [EXTERNAL] [COURRIEL EXTERNE] The "no_status" state
>
> CAUTION: This email originated from outside of the organization. Do not
> click links or open attachments unless you can confirm the sender and know
> the content is safe.
>
>
>
> AVERTISSEMENT: Ce courrier électronique provient d’un expéditeur externe.
> Ne cliquez sur aucun lien et n’ouvrez aucune pièce jointe si vous ne pouvez
> pas confirmer l’identité de l’expéditeur et si vous n’êtes pas certain que
> le contenu ne présente aucun risque.
>
>
>
> Hi Ryan,
>
> I really like the idea of exposing some more scheduler details. More
> transparency in scheduling also in the UI would help the user in (1) seeing
> and understanding what is going on and (2) reduces the need to crawl for
> logs and raise support tickets if status is “strange”. I often also see
> this as a problem. This is also sometimes generating a bit of “mis trust”
> in the scheduler stability.
>
> From point of scheduler “overhead” I assume as long as we are not making a
> “full scan” just to ensure that each and every task is always up-to-date
> (Scheduler stops processing today after enough tasks have been processes in
> a loop or if scheduling limits are reached) this is OK for me and on the
> code side does not seem to be much overhead.
> I have a bit of fear on the other hand that very many frequent updates
> need to happen on the DB as another state would need to be written. So more
> DB round trips are needed. This might hit performance for large DAGs or
> cases where DAGs are scheduled. So at least it would need to filter to
> update the state to DB only if changed to keep performance impact minimal.
>
> From point of naming I still think “no status” is good to indicate that
> scheduler did not digest anything, maybe task was never looked at because
> scheduler actually is really stuck or too busy getting there. I would
> propose if scheduler passes along a task and decides that it is not ready
> to schedule to have an additional state calling e.g. “not_ready” in the
> state model between “none” and “scheduled”.
>
> Finally on the other hand, adding another state in the model, I am not
> sure whether this 100% will help in the use case described by you. Still
> you might need to scratch your head a while if taking a look on UI that a
> DAG is “stuck” until you realize all the options you have configured.
> Exposing a “why is stuck” in a user friendly manner might be another level
> of complexity in this case.
>
> As the state model might touch a lot of code and there might be a longer
> discussion needed, would it be a need to raise an AIP for this? There might
> be a lot more (external, provider??) dependencies adjusting the state model?
>
> Mit freundlichen Grüßen / Best regards
>
> Jens Scheffler
>
> Deterministik open Loop (XC-DX/ETV5)
> Robert Bosch GmbH | Hessbruehlstraße 21 | 70565 Stuttgart-Vaihingen |
> GERMANY | www.bosch.com<http://www.bosch.com>
> Tel. +49 711 811-91508 | Mobil +49 160 90417410 |
> jens.scheff...@de.bosch.com<mailto:jens.scheff...@de.bosch.com>
>
> Sitz: Stuttgart, Registergericht: Amtsgericht Stuttgart, HRB 14000;
> Aufsichtsratsvorsitzender: Prof. Dr. Stefan Asenkerschbaumer;
> Geschäftsführung: Dr. Stefan Hartung,
> Dr. Christian Fischer, Dr. Markus Forschner, Stefan Grosch, Dr. Markus
> Heyn, Dr. Tanja Rückert
> 
> From: Ryan Hatter <ryan.hat...@astronomer.io.INVALID>
> Sent: Donnerstag, 28. September 2023 23:59
> To: dev@airflow.apache.org
> Subject: The "no_status" state
>
> Over the last couple weeks I've come across a rather tricky problem a few
> times. One DAG run gets "stuck" in the queued state, while subsequent DAG
> runs will be stuck running (screenshot below). One of these issues was
> caused by `max_active_runs` being met when a task instance from a
> previously run DAG was cleared, and one of the tasks had
> `depends_on_past=True`. This caused the DAG run to be stuck in queued in
> perpetuity until it was realized that the task that wasn't getting
> scheduled needed the failed task in the preceding DAG run to be re-run,
> which in turn causes the stuck running DAG runs to be stuck in running.
> which caused quite a bit of confusion and stress.
>
> Given that Airflow is pretty burnt out on task instance states and colors,
> I propose replacing "no_status" with "dependencies_not_met" and surfacing
> dependencies in the grid view instead of forcing users to already know
> where to look (i.e. "more details" task instance details). Now that I typed
> it out, I'm not sure there should be a reason for the "more details" button
> and not just laying out all of a task instance's details in the grid view
> similar to how the graph and code views are now included in the grid view.
>
> Anyway, I wanted to solicit feedback before I open an issue / start work
> on this.
>
> [cid:ii_ln3phzoe0]
>

Re: The "no_status" state

Reply via email to