potiuk commented on issue #39717: URL: https://github.com/apache/airflow/issues/39717#issuecomment-2217269610
> (and maybe that enters your definition of "config specific"). Yes. The thing is "whatever makes it possible to reproduce the issue". If the maintainers are not able to reproduce the issue reliable, they can only make guesses, and even worse, even if they will come up with the hypothesis, and implement a supposed fix, they are not able to confirm or even confidently say that it fixes the issue. Having a minimal reproduction scenario is an absolute key to diagnose and fix the issue. It's just a fact of "life" - fixing a problem that some remote, inaccessible installation that you are not able to dig deeper and look around (and nobody pays for time spent on it) is all about that. "Having reproducible scenario" increases chances of fixing the problem by at least few orders of magnitude. That's simply how it works. > Could you point to some statsd metrics that would help analyze the issue? (queue sizes, timeouts, something related to the scheduler load...) No - nothing specific comes to my mind - the problem with such issues is that if we knew what exactly to look at, then we would **know** what the cause of the issue is - which we don't - and try to get enough of signals to be able to deduce it. Like with everything in complex systems there are no easy recipes to follow for diagnosing "rare" issues that are unexpected (the sheer fact of it that they are unexpected makes them difficult to diagnose, because you have no idea what to look at and you need to look at the "whole" system and try to spot anomalies. However, soon (likely) in 2.10.0 there will be open-telemetry tracing integrated into Airflow's observability, and this one will give much more detailed information on what's going on with each task and I'd strongly recoommend to integrate it into your observability stack - especially that a number of tools that will use it will have an option to export such tracing and make it available to someone else than those who manager to be able to dig deeper. Until them we can mostly say "dig deeper". There is a monthly town hall tomorrow https://www.linkedin.com/feed/update/urn:li:activity:7216205556301090817 where @howardyoo is going to talk about it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
