I've submitted a PR at https://github.com/apache/airflow/pull/19843.
The code was written quite a while ago. Airflow 2.2 has since been released but I've not noticed any regressions on rebasing over it. Would appreciate any reviews. Thanks! On Sat, 28 Aug 2021 at 23:19, Chris Fei <[email protected]> wrote: > I do something similar to what you describe, where I can replace what's > displayed in the "Code" view based on some other content. I was able to > cobble it together by using airflow_local_settings.py to patch the function > that's invoked when the "Code" view is accessed. It's a total hack, but it > works for my use-case (I'm on airflow 2.1.3). YMMV— > > I have a python file in the dags folder that parses YAML files from a > directory and publishes DAGs to the global scope based on the content of > each YAML file (based on your previous messages, you probably have > something similar). When creating each DAG, I add a KV pair in default_args > for the file path to the underlying YAML. To access the YAML file in the > Code view on the webserver, I set up airflow_local_settings.py to monkey > patch the airflow.www.views.Airflow.code function with my own version. The > patched function sets up a DagBag to allows me to access the full DAG > object, where I can then read the YAML file from default_args and then > return its content in the output. In my case, the scheduler and webserver > run on the same VM so I can pass just the file path between them. If you > can't just pass around the file path, I suppose you could try tossing the > entire YAML content as a default_arg as well. > > This approach doesn't change fileloc and doesn't output the YAML file > content to the dag_code table. > > One obvious downside to this approach is that whenever I upgrade airflow I > have to double check in staging that my hack still works. But the function > I patch is super simple and my change is like 4 extra lines of code, so I > don't mind it. > > Chris > > On Sat, Aug 21, 2021, at 3:03 PM, Siddharth VP wrote: > > Ok. Before I give up on the idea, how so? To me it looks like a new > feature entirely, so not sure what you mean by "changing fileloc like that". > > On Sat, 21 Aug 2021 at 22:28, Ash Berlin-Taylor <[email protected]> wrote: > > FYI: Changing fileloc like that will have un-intended consequences to > execution in future versions (i.e. 2.2) so we can't do that. > > -ash > > On Sat, Aug 21 2021 at 09:50:57 +0530, Siddharth VP <[email protected]> > wrote: > > Yes per my commit linked above, it would be the yaml file that is shown in > the webserver code view (because the fileloc field points to that). > > I've been doing something similar to what Damian said, but with the > difference that I've to generate the YAMLs programmatically based on > parameters received at an API endpoint - and implement all the CRUD > operations on these yaml config files. During POC testing, I saw that when > there were 200+ unpaused DAGs generated this way, the dag processor timeout > was being hit. On increasing that timeout to a higher value, I got a > message on the UI that the last scheduler heartbeat was 1 minute ago (which > is likely because the scheduler was busy with DAG processing for the whole > minute). > > That's what brings me to this proposal. By embedding the task of parsing > the yaml/json within Airflow, dynamic dags are supported in a much more > "native" way, such that all timeouts and intervals apply individually to > the config files. > > This won't replace dag-factory <https://github.com/ajbosco/dag-factory> and > other ecosystem tools (because we still need to have the "builder" python > code), rather improve the scalability when using such tools, and avoid > compromising on scheduler availability (though I may have had this issue > only because I was using 1.10.12). > > Would love to hear feedback on whether the patch is PR-worthy -- because I > think it is quite simple (doesn't require any schema changes for instance) > but still addresses a lot of dynamic workflow needs. > > On Sat, 21 Aug 2021 at 03:29, Jarek Potiuk <[email protected]> wrote: > > Agree with Ash here. It's OK to present different view of the "source" of > the DAG once we parsed the Python code. This can be done and it could be as > easy as > > a) adding a field to dag to point to a "definition file" if the DAGs are > produced by parsing files from source folder > b) API call/parameter to submit/fetch the dag (providing we implement some > form of DAG fetcher/DAG submission) > > On Fri, Aug 20, 2021 at 10:49 PM Ash Berlin-Taylor <[email protected]> wrote: > > Changing the code view to show the YAML is now "relatively" easy to > achieve, at least from the webserver point of view, as since 2.0 it doesn't > read the files on disk, but from the DB. > > There's a lot of details, but changing the way these DagCode rows are > written could be achievable whilst still keeping the "there must be a > python file to generate the dag". > > -ash > > On Fri, Aug 20 2021 at 20:41:55 +0000, "Shaw, Damian P." < > [email protected]> wrote: > > I’d personally find this very useful. There’s usually extra information I > have about the DAG, and the current “docs_md” is usually not nearly > sufficient enough as it’s poorly placed so if I start adding a lot of info > it gets in the way of the regular UI. Also last I tested the markdown > formatting didn’t work and neither did the other formatter options. > > > > But I’m not sure how much other people have demand for this. > > > > Thanks, > > Damian > > > > *From:* Collin McNulty <[email protected]> > *Sent:* Friday, August 20, 2021 16:36 > *To:* [email protected] > *Subject:* Re: [DISCUSS] Adding better support for parametrized DAGs and > dynamic DAGs using JSON/YAML dataformats > > > > On the topic of pointing the code view to yaml, would we alternatively > consider adding a view on the UI that would allow arbitrary text content? > This could be accomplished by adding an optional parameter to the dag > object that allowed you to pass text (or a filepath) that would then go > through a renderer (e.g. markdown). It could be a readme, or yaml content > or anything the author wanted. > > > > Collin > > > > On Fri, Aug 20, 2021 at 3:27 PM Shaw, Damian P. < > [email protected]> wrote: > > FYI this is what I did on one of my past projects for Airflow. > > > > The users wanted to write their DAGs as YAML files so my “DAG file” was a > Python script that read the YAML files and converted them to DAGs. It was > very easy to do and worked because of the flexibility of Airflow. > > > > The one thing that would have been nice though is if I could of easily > changed the “code view” in Airflow to point to the relevant YAML file > instead of the less useful “DAG file”. > > > > Damian > > > > *From:* Jarek Potiuk <[email protected]> > *Sent:* Friday, August 20, 2021 16:21 > *To:* [email protected] > *Cc:* [email protected] > *Subject:* Re: [DISCUSS] Adding better support for parametrized DAGs and > dynamic DAGs using JSON/YAML dataformats > > > > Airflow DAGS are Python code.This is a very basic assumption - which is > not likely to change. Ever. > > > > And we are working on making it even more powerful. Writing DAGs in > yaml/json makes them less powerful and less flexible. This is fine if you > want to build on top of airflow and build a more declarative way of > defining dags and use airflow to run it under the hood. > > if you think there is a group of users who can benefit from that - cool. > You can publish a code to convert those to Airflow DAGs and submit it to > our Ecosystem page. There are plenty of tlike "CWL - Common Workflow > Language" and others: > > https://airflow.apache.org/ecosystem/#tools-integrating-with-airflow > > > > J. > > > > On Fri, Aug 20, 2021 at 2:48 PM Siddharth VP <[email protected]> > wrote: > > Have we considered allowing dags in json/yaml formats before? I came up > with a rather straightforward way to address parametrized and dynamic DAGs > in Airflow, which I think makes dynamic dags work at scale. > > > > *Background / Current limitations:* > > 1. Dynamic DAG generation using single-file methods > <https://www.astronomer.io/guides/dynamically-generating-dags#single-file-methods> > can > cause scalability issues > <https://www.astronomer.io/guides/dynamically-generating-dags#scalability> > where there are too many active DAGs per file. The > dag_file_processor_timeout is applied to the loader file, so *all* dynamically > generated dags need to be processed in that time. Sure the timeout could be > increased, but that may be undesirable (what if there are other static DAGs > in the system on which we really want to enforce a small timeout?) > > 2. Parametrizing DAGs in Airflow is difficult. There is no good way to > have multiple workflows that differ only by choices of some constants. > Using TriggerDagRunOperator to trigger a generic DAG with conf doesn't give > a native-ish experience as it creates DagRuns of the *triggered* dag > rather than *this* dag - which also means a single scheduler log file. > > > > *Suggested approach:* > > 1. User writes configuration files in JSON/YAML format. The schema can be > arbitrary except for one condition that it must have a *builder* parameter > with the path to a python file. > > 2. User writes the "builder" - a python file containing a make_dag method > that receives the parsed json/yaml and returns a DAG object. (Just a > sample strategy, we could instead say the file should contain a class that > extends an abstract DagBuilder class.) > > 2. Airflow reads JSON/YAML files as well from the dags directory. It > parses the file, imports the builder python file, and passes the parsed > json/yaml to it and collects the generated DAG into the DagBag. > > > > *Sample implementation:* > > See > https://github.com/siddharthvp/airflow/commit/47bad51fc4999737e9a300b134c04bbdbd04c88a; > only major code change is in dagbag.py > > > > *Result:* > > Dag file processor logs show yaml/json file (instead of the builder python > file). Each dynamically generated dag gets its own scheduler log file. > > The configs dag_dir_list_interval, min_file_process_interval, > file_parsing_sort_mode all directly apply to dag config files. > > If the json/yaml fail to parse, it's registered as an import error. > > > > Would like to know your thoughts on this. Thanks! > > Siddharth VP > > > > > -- > > +48 660 796 129 > > > > > ============================================================================== > Please access the attached hyperlink for an important electronic > communications disclaimer: > http://www.credit-suisse.com/legal/en/disclaimer_email_ib.html > > ============================================================================== > > > > ============================================================================== > Please access the attached hyperlink for an important electronic > communications disclaimer: > http://www.credit-suisse.com/legal/en/disclaimer_email_ib.html > > ============================================================================== > > > > -- > +48 660 796 129 > > >
