Have we considered allowing dags in json/yaml formats before? I came up with a rather straightforward way to address parametrized and dynamic DAGs in Airflow, which I think makes dynamic dags work at scale.
*Background / Current limitations:* 1. Dynamic DAG generation using single-file methods <https://www.astronomer.io/guides/dynamically-generating-dags#single-file-methods> can cause scalability issues <https://www.astronomer.io/guides/dynamically-generating-dags#scalability> where there are too many active DAGs per file. The dag_file_processor_timeout is applied to the loader file, so *all* dynamically generated dags need to be processed in that time. Sure the timeout could be increased, but that may be undesirable (what if there are other static DAGs in the system on which we really want to enforce a small timeout?) 2. Parametrizing DAGs in Airflow is difficult. There is no good way to have multiple workflows that differ only by choices of some constants. Using TriggerDagRunOperator to trigger a generic DAG with conf doesn't give a native-ish experience as it creates DagRuns of the *triggered* dag rather than *this* dag - which also means a single scheduler log file. *Suggested approach:* 1. User writes configuration files in JSON/YAML format. The schema can be arbitrary except for one condition that it must have a *builder* parameter with the path to a python file. 2. User writes the "builder" - a python file containing a make_dag method that receives the parsed json/yaml and returns a DAG object. (Just a sample strategy, we could instead say the file should contain a class that extends an abstract DagBuilder class.) 2. Airflow reads JSON/YAML files as well from the dags directory. It parses the file, imports the builder python file, and passes the parsed json/yaml to it and collects the generated DAG into the DagBag. *Sample implementation:* See https://github.com/siddharthvp/airflow/commit/47bad51fc4999737e9a300b134c04bbdbd04c88a; only major code change is in dagbag.py *Result:* Dag file processor logs show yaml/json file (instead of the builder python file). Each dynamically generated dag gets its own scheduler log file. The configs dag_dir_list_interval, min_file_process_interval, file_parsing_sort_mode all directly apply to dag config files. If the json/yaml fail to parse, it's registered as an import error. Would like to know your thoughts on this. Thanks! Siddharth VP
