potiuk commented on issue #39717:
URL: https://github.com/apache/airflow/issues/39717#issuecomment-2217269610

   >  (and maybe that enters your definition of "config specific").
   
   Yes. The thing is "whatever makes it possible to reproduce the issue". If 
the maintainers are not able to reproduce the issue reliable, they can only 
make guesses, and even worse, even if they will come up with the hypothesis, 
and implement a supposed fix, they are not able to confirm or even confidently 
say that it fixes the issue. Having a minimal reproduction scenario is an 
absolute key to diagnose and fix the issue. It's just a fact of "life" - fixing 
a problem that some remote, inaccessible installation that you are not able to 
dig deeper and look around (and nobody pays for time spent on it) is all about 
that. "Having reproducible scenario" increases chances of fixing the problem by 
at least few orders of magnitude. That's simply how it works.
   
   > Could you point to some statsd metrics that would help analyze the issue? 
(queue sizes, timeouts, something related to the scheduler load...)
   
   No - nothing specific comes to my mind - the problem with such issues is 
that if we knew what exactly to look at, then we would **know** what the cause 
of the issue is - which we don't - and try to get enough of signals to be able 
to deduce it. Like with everything in complex systems there are no easy recipes 
to follow for diagnosing "rare" issues that are unexpected (the sheer fact of 
it that they are unexpected makes them difficult to diagnose, because you have 
no idea what to look at and you need to look at the "whole" system and try to 
spot anomalies. 
   
   However, soon (likely) in 2.10.0 there will be open-telemetry tracing 
integrated into Airflow's observability, and this one will give much more 
detailed information on what's going on with each task and I'd strongly 
recoommend to integrate it into your observability stack - especially that a 
number of tools that will use it will have an option to export such tracing and 
make it available to someone else than those who manager to be able to dig 
deeper. Until them we can mostly say "dig deeper".
   
   There is a monthly town hall tomorrow 
https://www.linkedin.com/feed/update/urn:li:activity:7216205556301090817 where 
@howardyoo is going to talk about it.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to