Ok. Before I give up on the idea, how so? To me it looks like a new feature entirely, so not sure what you mean by "changing fileloc like that".
On Sat, 21 Aug 2021 at 22:28, Ash Berlin-Taylor <[email protected]> wrote: > FYI: Changing fileloc like that will have un-intended consequences to > execution in future versions (i.e. 2.2) so we can't do that. > > -ash > > On Sat, Aug 21 2021 at 09:50:57 +0530, Siddharth VP <[email protected]> > wrote: > > Yes per my commit linked above, it would be the yaml file that is shown in > the webserver code view (because the fileloc field points to that). > > I've been doing something similar to what Damian said, but with the > difference that I've to generate the YAMLs programmatically based on > parameters received at an API endpoint - and implement all the CRUD > operations on these yaml config files. During POC testing, I saw that when > there were 200+ unpaused DAGs generated this way, the dag processor timeout > was being hit. On increasing that timeout to a higher value, I got a > message on the UI that the last scheduler heartbeat was 1 minute ago (which > is likely because the scheduler was busy with DAG processing for the whole > minute). > > That's what brings me to this proposal. By embedding the task of parsing > the yaml/json within Airflow, dynamic dags are supported in a much more > "native" way, such that all timeouts and intervals apply individually to > the config files. > > This won't replace dag-factory <https://github.com/ajbosco/dag-factory> and > other ecosystem tools (because we still need to have the "builder" python > code), rather improve the scalability when using such tools, and avoid > compromising on scheduler availability (though I may have had this issue > only because I was using 1.10.12). > > Would love to hear feedback on whether the patch is PR-worthy -- because I > think it is quite simple (doesn't require any schema changes for instance) > but still addresses a lot of dynamic workflow needs. > > On Sat, 21 Aug 2021 at 03:29, Jarek Potiuk <[email protected]> wrote: > >> Agree with Ash here. It's OK to present different view of the "source" of >> the DAG once we parsed the Python code. This can be done and it could be as >> easy as >> >> a) adding a field to dag to point to a "definition file" if the DAGs are >> produced by parsing files from source folder >> b) API call/parameter to submit/fetch the dag (providing we implement >> some form of DAG fetcher/DAG submission) >> >> On Fri, Aug 20, 2021 at 10:49 PM Ash Berlin-Taylor <[email protected]> >> wrote: >> >>> Changing the code view to show the YAML is now "relatively" easy to >>> achieve, at least from the webserver point of view, as since 2.0 it doesn't >>> read the files on disk, but from the DB. >>> >>> There's a lot of details, but changing the way these DagCode rows are >>> written could be achievable whilst still keeping the "there must be a >>> python file to generate the dag". >>> >>> -ash >>> >>> On Fri, Aug 20 2021 at 20:41:55 +0000, "Shaw, Damian P." < >>> [email protected]> wrote: >>> >>> I’d personally find this very useful. There’s usually extra information >>> I have about the DAG, and the current “docs_md” is usually not nearly >>> sufficient enough as it’s poorly placed so if I start adding a lot of info >>> it gets in the way of the regular UI. Also last I tested the markdown >>> formatting didn’t work and neither did the other formatter options. >>> >>> >>> >>> But I’m not sure how much other people have demand for this. >>> >>> >>> >>> Thanks, >>> >>> Damian >>> >>> >>> >>> *From:* Collin McNulty <[email protected]> >>> *Sent:* Friday, August 20, 2021 16:36 >>> *To:* [email protected] >>> *Subject:* Re: [DISCUSS] Adding better support for parametrized DAGs >>> and dynamic DAGs using JSON/YAML dataformats >>> >>> >>> >>> On the topic of pointing the code view to yaml, would we alternatively >>> consider adding a view on the UI that would allow arbitrary text content? >>> This could be accomplished by adding an optional parameter to the dag >>> object that allowed you to pass text (or a filepath) that would then go >>> through a renderer (e.g. markdown). It could be a readme, or yaml content >>> or anything the author wanted. >>> >>> >>> >>> Collin >>> >>> >>> >>> On Fri, Aug 20, 2021 at 3:27 PM Shaw, Damian P. < >>> [email protected]> wrote: >>> >>> FYI this is what I did on one of my past projects for Airflow. >>> >>> >>> >>> The users wanted to write their DAGs as YAML files so my “DAG file” was >>> a Python script that read the YAML files and converted them to DAGs. It was >>> very easy to do and worked because of the flexibility of Airflow. >>> >>> >>> >>> The one thing that would have been nice though is if I could of easily >>> changed the “code view” in Airflow to point to the relevant YAML file >>> instead of the less useful “DAG file”. >>> >>> >>> >>> Damian >>> >>> >>> >>> *From:* Jarek Potiuk <[email protected]> >>> *Sent:* Friday, August 20, 2021 16:21 >>> *To:* [email protected] >>> *Cc:* [email protected] >>> *Subject:* Re: [DISCUSS] Adding better support for parametrized DAGs >>> and dynamic DAGs using JSON/YAML dataformats >>> >>> >>> >>> Airflow DAGS are Python code.This is a very basic assumption - which is >>> not likely to change. Ever. >>> >>> >>> >>> And we are working on making it even more powerful. Writing DAGs in >>> yaml/json makes them less powerful and less flexible. This is fine if you >>> want to build on top of airflow and build a more declarative way of >>> defining dags and use airflow to run it under the hood. >>> >>> if you think there is a group of users who can benefit from that - cool. >>> You can publish a code to convert those to Airflow DAGs and submit it to >>> our Ecosystem page. There are plenty of tlike "CWL - Common Workflow >>> Language" and others: >>> >>> https://airflow.apache.org/ecosystem/#tools-integrating-with-airflow >>> >>> >>> >>> J. >>> >>> >>> >>> On Fri, Aug 20, 2021 at 2:48 PM Siddharth VP <[email protected]> >>> wrote: >>> >>> Have we considered allowing dags in json/yaml formats before? I came up >>> with a rather straightforward way to address parametrized and dynamic DAGs >>> in Airflow, which I think makes dynamic dags work at scale. >>> >>> >>> >>> *Background / Current limitations:* >>> >>> 1. Dynamic DAG generation using single-file methods >>> <https://www.astronomer.io/guides/dynamically-generating-dags#single-file-methods> >>> can >>> cause scalability issues >>> <https://www.astronomer.io/guides/dynamically-generating-dags#scalability> >>> where there are too many active DAGs per file. The >>> dag_file_processor_timeout is applied to the loader file, so *all* >>> dynamically >>> generated dags need to be processed in that time. Sure the timeout could be >>> increased, but that may be undesirable (what if there are other static DAGs >>> in the system on which we really want to enforce a small timeout?) >>> >>> 2. Parametrizing DAGs in Airflow is difficult. There is no good way to >>> have multiple workflows that differ only by choices of some constants. >>> Using TriggerDagRunOperator to trigger a generic DAG with conf doesn't give >>> a native-ish experience as it creates DagRuns of the *triggered* dag >>> rather than *this* dag - which also means a single scheduler log file. >>> >>> >>> >>> *Suggested approach:* >>> >>> 1. User writes configuration files in JSON/YAML format. The schema can >>> be arbitrary except for one condition that it must have a *builder* >>> parameter >>> with the path to a python file. >>> >>> 2. User writes the "builder" - a python file containing a make_dag method >>> that receives the parsed json/yaml and returns a DAG object. (Just a >>> sample strategy, we could instead say the file should contain a class that >>> extends an abstract DagBuilder class.) >>> >>> 2. Airflow reads JSON/YAML files as well from the dags directory. It >>> parses the file, imports the builder python file, and passes the parsed >>> json/yaml to it and collects the generated DAG into the DagBag. >>> >>> >>> >>> *Sample implementation:* >>> >>> See >>> https://github.com/siddharthvp/airflow/commit/47bad51fc4999737e9a300b134c04bbdbd04c88a; >>> only major code change is in dagbag.py >>> >>> >>> >>> *Result:* >>> >>> Dag file processor logs show yaml/json file (instead of the builder >>> python file). Each dynamically generated dag gets its own scheduler log >>> file. >>> >>> The configs dag_dir_list_interval, min_file_process_interval, >>> file_parsing_sort_mode all directly apply to dag config files. >>> >>> If the json/yaml fail to parse, it's registered as an import error. >>> >>> >>> >>> Would like to know your thoughts on this. Thanks! >>> >>> Siddharth VP >>> >>> >>> >>> >>> -- >>> >>> +48 660 796 129 >>> >>> >>> >>> >>> ============================================================================== >>> Please access the attached hyperlink for an important electronic >>> communications disclaimer: >>> http://www.credit-suisse.com/legal/en/disclaimer_email_ib.html >>> >>> ============================================================================== >>> >>> >>> >>> ============================================================================== >>> Please access the attached hyperlink for an important electronic >>> communications disclaimer: >>> http://www.credit-suisse.com/legal/en/disclaimer_email_ib.html >>> >>> ============================================================================== >>> >>> >> >> -- >> +48 660 796 129 >> >
