I've successfully been using Airflow the past few months to run relatively simple dags on a regular cadence. However, I've run into stability issues with more complex workflows that consist of parent/child dags, some of which may contain hundreds of tasks. The most common symptom is that state transitions do not always happen even though the previous task succeeds, requiring human monitoring/prodding when it gets stuck.
I would appreciate any advice on the things to look at to debug this or common causes for this. Thank you, Ken
