GitHub user jradhima added a comment to the discussion: Need help understanding total number of Dags oscillating on UI
Hello all, we recently encountered such an issue in our setup. Briefly: - deployment on k8s using the official helm chart, 4+ schedulers, 4K+ dags - parsing takes approx 25 minutes with `min_file_process_interval` set to 50 minutes - moved to standalone dag processor and dags were disappearing and reappearing from the UI After reading this post, we looked into the interaction between scheduler and dag processor and we found out that the issue was: - parser parses every 50 minutes - dag_stale_not_seen_duration is set to default 10 minutes - scheduler removes all dags 10 minutes after the processor parses them Our solution for the moment is to set `dag_stale_not_seen_duration` equal to `min_file_process_interval` + 1h. However, we do have this question: the same setup does not cause any issues when the schedulers are parsing the dags - is this a "bug"/unforeseen behaviour of the setup, or is this considered standard behaviour that all users should be familiar with and this works as expected? To generalise the above, after looking into the source code and the logs more closely we realised that both the scheduler and the dag processor are responsible for removing "stale" dags in different ways. My opinion is that this can cause confusion because when moving the parsing of the dags into a standalone component, it makes sense to move everything regarding dags, serialised dags & their management into that component. --- Tangential to the above, allow me to pose another question: (to what extent) does running multiple dag processing managers end up in duplicate work and under what circumstances? I've seen many logs and indications that the processing managers work independently, however it doesn't make sense that they do not communicate with the database at all, so there has to be some interaction through the "last_parsed" attribute of the `serialized_dag` table. My current understanding is: - each processing manager will make a file queue using the last modified time (from the db) and checking against min_file_process_interval - until the file queue is cleared, the manager will not communicate with the database again, the manager does not check the db on a per-dag or per-file basis - it will bulk-update the database with the parsing results in the end - this means that if a file has been picked up by a processor manager, it will remain "open" for others until the manager bulk syncs I am sure that this is not exactly how the process works but it should contain elements of truth in it. I know that processors do *some* duplicate work, what I am unsure is if they always do duplicate work. I think this area is lacking in the docs and would be open to help clarify it, however I would first need to understand it completely myself! GitHub link: https://github.com/apache/airflow/discussions/44495#discussioncomment-12229194 ---- This is an automatically sent email for [email protected]. To unsubscribe, please send an email to: [email protected]
