GitHub user jradhima added a comment to the discussion: Need help understanding 
total number of Dags oscillating on UI

Hello all, we recently encountered such an issue in our setup. Briefly:

- deployment on k8s using the official helm chart, 4+ schedulers, 4K+ dags
- parsing takes approx 25 minutes with `min_file_process_interval` set to 50 
minutes
- moved to standalone dag processor and dags were disappearing and reappearing 
from the UI

After reading this post, we looked into the interaction between scheduler and 
dag processor and we found out that the issue was:
- parser parses every 50 minutes
- dag_stale_not_seen_duration is set to default 10 minutes
- scheduler removes all dags 10 minutes after the processor parses them

Our solution for the moment is to set `dag_stale_not_seen_duration` equal to 
`min_file_process_interval` + 1h. 

However, we do have this question: the same setup does not cause any issues 
when the schedulers are parsing the dags - is this a "bug"/unforeseen behaviour 
of the  setup, or is this considered standard behaviour that all users should 
be familiar with and this works as expected?

To generalise the above, after looking into the source code and the logs more 
closely we realised that both the scheduler and the dag processor are 
responsible for removing "stale" dags in different ways. My opinion is that 
this can cause confusion because when moving the parsing of the dags into a 
standalone component, it makes sense to move everything regarding dags, 
serialised dags & their management into that component.

---

Tangential to the above, allow me to pose another question: (to what extent) 
does running multiple dag processing managers end up in duplicate work and 
under what circumstances? I've seen many logs and indications that the 
processing managers work independently, however it doesn't make sense that they 
do not communicate with the database at all, so there has to be some 
interaction through the "last_parsed" attribute of the `serialized_dag` table.

 My current understanding is:
- each processing manager will make a file queue using the last modified time 
(from the db) and checking against min_file_process_interval
- until the file queue is cleared, the manager will not communicate with the 
database again, the manager does not check the db on a per-dag or per-file basis
- it will bulk-update the database with the parsing results in the end
- this means that if a file has been picked up by a processor manager, it will 
remain "open" for others until the manager bulk syncs

I am sure that this is not exactly how the process works but it should contain 
elements of truth in it. I know that processors do *some* duplicate work, what 
I am unsure is if they always do duplicate work. I think this area is lacking 
in the docs and would be open to help clarify it, however I would first need to 
understand it completely myself!

GitHub link: 
https://github.com/apache/airflow/discussions/44495#discussioncomment-12229194

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]

Reply via email to