Hi there,

We’re using Airflow in our startup and it’s been great in many ways, thanks for 
the work you guys are doing!

Unfortunately, we’re hitting a bunch of issues with ops timing out, DAGs 
failing for unclear reasons, with no logs or the following error: 
"airflow.exceptions.AirflowException: dag_id could not be found”. This seems to 
happen when enough DAGs are running at the same time, though it can also happen 
more rarely here and there. But, the best way to reproduce the error with our 
setup is to run enough DAGs at once. Most of the time, clearing the DAG run or 
ops that have failed and letting the DAG re-run is enough to fix the problem.

I found resources pointing to the dagbag_import_timeout, e.g., 
https://stackoverflow.com/questions/43235130/airflow-dag-id-could-not-be-found 
<https://stackoverflow.com/questions/43235130/airflow-dag-id-could-not-be-found>.
I did play with that parameter, and other parameters as well. And it does seem 
that they help, i.e., I can run more DAGs at once, but
        (1) if I run enough DAGs at once, I still see ops and DAGs failing, so 
the problem is not fixed ; 
        (2) more importantly, I don’t fully understand the problem. I have some 
ideas on what is happening, but maybe I’m totally wrong?

Any recommendations on how I should investigate that?

Thank you very much!
Have a nice rest of the day,
Stéphane
http://stephanebonneaud.com <http://stephanebonneaud.com/>

Reply via email to