GitHub user rtrindvg edited a discussion: Need help understanding total number
of Dags oscillating on UI
I am in the middle of a migration from an Airflow in a Virtual Machine to a
kubernetes cluster, for now in a staging environment. After a lot of
configuration adjustments on the Helm values.yaml, the cluster seems to be
stable and working fine.
But for some reason, the UI sometimes shows less DAGs than what are available.
For example, we have a total of 93 DAGs. After the initial load, which takes a
couple of minutes, it becomes stable for some time. Than it reduces to a
smaller number (like 64) and after a couple of minutes, it starts to go back up
again, eventually returning to 93 again. We confirmed this is not any kind of
browser cache. There were no restarts of any pods in the meantime, no changes
to the cluster and no DAGs were changes as well.
We are using git-sync in a non-persistent storage, like it's recommended in the
docs. We activated the debug logs on it and it seems to be working fine, just
downloading changes when the DAGs branch has changes, and they seem to be
propagating quickly to all relevant pods.
The scheduler logs were not clear on any kind of errors which could justify the
drop in total DAGs, except the following line:
```
[2024-11-30T02:10:03.298+0000] {scheduler_job_runner.py:1782} INFO - Found (8)
stales dags not parsed after 2024-11-30 02:00:03.296796+00:00.
```
I am researching if this is relevant to the issue at hand, but unsuccessfully
so far.
Another fix we tried was activating the non-default DAG processor, but the
behavior is the same. I tried activating the processor verbose mode using the
env parameter, unsuccessfully. The logs are mostly blank, so I have no clue if
the DAG processor is the culprit.
We also replaced the CeleryExecutor to the KubernetesExecutor, because it is
more suited to our purposes. We did not think it had any relation to the issue
and, as expected, the behavior persists.
Since I am from the cloud-infra team, and have no previous experience in
Airflow, can someone help me understand what could be the issue and possible
next steps in diagnosing our environment?
We are using airflow 2.9.3 (since it's the most recent in the latest Helm
available), python 3.12, in a custom Dockerfile. We are not extending the
image, we are really customizing it, since we need to perform a couple of
compilations and it was more optimal to do this prior to the airflow pip
installs, to make the rebuild faster and the final image smaller. I did not
know if it was safe to point the image to the latest Airflow available (since I
assume an updated Helm would be published if this was the case), so we kept
using it this one.
Embedding the DAGs onto the image is not an option, since they are changed
constantly and the time to rebuild and the process of redeploying the cluster
several times a day is not ideal for us. If updating the cluster to 2.10.3 is
safe and has any known issues regarding this behavior, please point me in the
right way.
Thanks for any tips!
GitHub link: https://github.com/apache/airflow/discussions/44495
----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]