Yes per my commit linked above, it would be the yaml file that is shown in
the webserver code view (because the fileloc field points to that).

I've been doing something similar to what Damian said, but with the
difference that I've to generate the YAMLs programmatically based on
parameters received at an API endpoint - and implement all the CRUD
operations on these yaml config files. During POC testing, I saw that when
there were 200+ unpaused DAGs generated this way, the dag processor timeout
was being hit. On increasing that timeout to a higher value, I got a
message on the UI that the last scheduler heartbeat was 1 minute ago (which
is likely because the scheduler was busy with DAG processing for the whole
minute).

That's what brings me to this proposal. By embedding the task of parsing
the yaml/json within Airflow, dynamic dags are supported in a much more
"native" way, such that all timeouts and intervals apply individually to
the config files.

This won't replace dag-factory <https://github.com/ajbosco/dag-factory> and
other ecosystem tools (because we still need to have the "builder" python
code), rather improve the scalability when using such tools, and avoid
compromising on scheduler availability (though I may have had this issue
only because I was using 1.10.12).

Would love to hear feedback on whether the patch is PR-worthy -- because I
think it is quite simple (doesn't require any schema changes for instance)
but still addresses a lot of dynamic workflow needs.

On Sat, 21 Aug 2021 at 03:29, Jarek Potiuk <[email protected]> wrote:

> Agree with Ash here. It's OK to present different view of the "source" of
> the DAG once we parsed the Python code. This can be done and it could be as
> easy as
>
> a) adding a field to dag to point to a "definition file" if the DAGs are
> produced by parsing files from source folder
> b) API call/parameter to submit/fetch the dag (providing we implement some
> form of DAG fetcher/DAG submission)
>
> On Fri, Aug 20, 2021 at 10:49 PM Ash Berlin-Taylor <[email protected]> wrote:
>
>> Changing the code view to show the YAML is now "relatively" easy to
>> achieve, at least from the webserver point of view, as since 2.0 it doesn't
>> read the files on disk, but from the DB.
>>
>> There's a lot of details, but changing the way these DagCode rows are
>> written could be achievable whilst still keeping the "there must be a
>> python file to generate the dag".
>>
>> -ash
>>
>> On Fri, Aug 20 2021 at 20:41:55 +0000, "Shaw, Damian P." <
>> [email protected]> wrote:
>>
>> I’d personally find this very useful. There’s usually extra information I
>> have about the DAG, and the current “docs_md” is usually not nearly
>> sufficient enough as it’s poorly placed so if I start adding a lot of info
>> it gets in the way of the regular UI. Also last I tested the markdown
>> formatting didn’t work and neither did the other formatter options.
>>
>>
>>
>> But I’m not sure how much other people have demand for this.
>>
>>
>>
>> Thanks,
>>
>> Damian
>>
>>
>>
>> *From:* Collin McNulty <[email protected]>
>> *Sent:* Friday, August 20, 2021 16:36
>> *To:* [email protected]
>> *Subject:* Re: [DISCUSS] Adding better support for parametrized DAGs and
>> dynamic DAGs using JSON/YAML dataformats
>>
>>
>>
>> On the topic of pointing the code view to yaml, would we alternatively
>> consider adding a view on the UI that would allow arbitrary text content?
>> This could be accomplished by adding an optional parameter to the dag
>> object that allowed you to pass text (or a filepath) that would then go
>> through a renderer (e.g. markdown). It could be a readme, or yaml content
>> or anything the author wanted.
>>
>>
>>
>> Collin
>>
>>
>>
>> On Fri, Aug 20, 2021 at 3:27 PM Shaw, Damian P. <
>> [email protected]> wrote:
>>
>> FYI this is what I did on one of my past projects for Airflow.
>>
>>
>>
>> The users wanted to write their DAGs as YAML files so my “DAG file” was a
>> Python script that read the YAML files and converted them to DAGs. It was
>> very easy to do and worked because of the flexibility of Airflow.
>>
>>
>>
>> The one thing that would have been nice though is if I could of easily
>> changed the “code view” in Airflow to point to the relevant YAML file
>> instead of the less useful “DAG file”.
>>
>>
>>
>> Damian
>>
>>
>>
>> *From:* Jarek Potiuk <[email protected]>
>> *Sent:* Friday, August 20, 2021 16:21
>> *To:* [email protected]
>> *Cc:* [email protected]
>> *Subject:* Re: [DISCUSS] Adding better support for parametrized DAGs and
>> dynamic DAGs using JSON/YAML dataformats
>>
>>
>>
>> Airflow DAGS are Python code.This is a very basic assumption - which is
>> not likely to change. Ever.
>>
>>
>>
>> And we are working on making it even more powerful. Writing DAGs in
>> yaml/json makes them less powerful and less flexible. This is fine if you
>> want to build on top of airflow and build a more declarative way of
>> defining dags and use airflow to run it under the hood.
>>
>> if you think there is a group of users who can benefit from that - cool.
>> You can publish a code to convert those to Airflow DAGs and submit it to
>> our Ecosystem page. There are plenty of tlike "CWL - Common Workflow
>> Language" and others:
>>
>> https://airflow.apache.org/ecosystem/#tools-integrating-with-airflow
>>
>>
>>
>> J.
>>
>>
>>
>> On Fri, Aug 20, 2021 at 2:48 PM Siddharth VP <[email protected]>
>> wrote:
>>
>> Have we considered allowing dags in json/yaml formats before? I came up
>> with a rather straightforward way to address parametrized and dynamic DAGs
>> in Airflow, which I think makes dynamic dags work at scale.
>>
>>
>>
>> *Background / Current limitations:*
>>
>> 1. Dynamic DAG generation using single-file methods
>> <https://www.astronomer.io/guides/dynamically-generating-dags#single-file-methods>
>>  can
>> cause scalability issues
>> <https://www.astronomer.io/guides/dynamically-generating-dags#scalability>
>> where there are too many active DAGs per file. The
>> dag_file_processor_timeout is applied to the loader file, so *all* 
>> dynamically
>> generated dags need to be processed in that time. Sure the timeout could be
>> increased, but that may be undesirable (what if there are other static DAGs
>> in the system on which we really want to enforce a small timeout?)
>>
>> 2. Parametrizing DAGs in Airflow is difficult. There is no good way to
>> have multiple workflows that differ only by choices of some constants.
>> Using TriggerDagRunOperator to trigger a generic DAG with conf doesn't give
>> a native-ish experience as it creates DagRuns of the *triggered* dag
>> rather than *this* dag - which also means a single scheduler log file.
>>
>>
>>
>> *Suggested approach:*
>>
>> 1. User writes configuration files in JSON/YAML format. The schema can be
>> arbitrary except for one condition that it must have a *builder* parameter
>> with the path to a python file.
>>
>> 2. User writes the "builder" - a python file containing a make_dag method
>> that receives the parsed json/yaml and returns a DAG object. (Just a
>> sample strategy, we could instead say the file should contain a class that
>> extends an abstract DagBuilder class.)
>>
>> 2. Airflow reads JSON/YAML files as well from the dags directory. It
>> parses the file, imports the builder python file, and passes the parsed
>> json/yaml to it and collects the generated DAG into the DagBag.
>>
>>
>>
>> *Sample implementation:*
>>
>> See
>> https://github.com/siddharthvp/airflow/commit/47bad51fc4999737e9a300b134c04bbdbd04c88a;
>> only major code change is in dagbag.py
>>
>>
>>
>> *Result:*
>>
>> Dag file processor logs show yaml/json file (instead of the builder
>> python file). Each dynamically generated dag gets its own scheduler log
>> file.
>>
>> The configs dag_dir_list_interval, min_file_process_interval,
>> file_parsing_sort_mode all directly apply to dag config files.
>>
>> If the json/yaml fail to parse, it's registered as an import error.
>>
>>
>>
>> Would like to know your thoughts on this. Thanks!
>>
>> Siddharth VP
>>
>>
>>
>>
>> --
>>
>> +48 660 796 129
>>
>>
>>
>>
>> ==============================================================================
>> Please access the attached hyperlink for an important electronic
>> communications disclaimer:
>> http://www.credit-suisse.com/legal/en/disclaimer_email_ib.html
>>
>> ==============================================================================
>>
>>
>>
>> ==============================================================================
>> Please access the attached hyperlink for an important electronic
>> communications disclaimer:
>> http://www.credit-suisse.com/legal/en/disclaimer_email_ib.html
>>
>> ==============================================================================
>>
>>
>
> --
> +48 660 796 129
>

Reply via email to