potiuk commented on issue #37119:
URL: https://github.com/apache/airflow/issues/37119#issuecomment-1920070412
No. Same as in the dicussion you quoted - if you want to genereate
sequential list of tasks at runtime to execute, there is no such feature in
Airflow - not with the Ariflow definition ot tasks. If you want N-sequentially
executed, independent task, the number of them and dependencies between them
(i.e. DAG structure MUST be set at parsing time) . There is no such feature in
Airflow. One of the reasons is that if you want to dynamically create such
task, you need to dynamically create dependencies, and that might miean that
for example your dag will stop being a DAG, it will become dynamically a graph
with a circle for example. that's why you canmot add dependencies dynamically,
the whole DAG graph must be resolved before scheduler will start scheduling the
tasks, because it has to calculate dependencies.
What you can do however (since you do not want to use parallelism feature of
Airflow and distributing such sequential tasks among different nodes) - you can
write your own "sequential execution task" that will use EMRC hook and will
simply exectute your tasks in a loop one by one. And then then loop can be
arbitrary long and dyanamic.
Im this case you will not get UI visualisation, retries. partial reruns and
clearing and the like,. But you will get basic Nx task executed in sequence.
You can also emulate a bit that by assigning 1-slot pool to all tasks in a
group where your tasks will be competing for pool slot. But this does not
guarantee sequence of execution, all your dynamically mapped tasks will be
still running technically in parallel, but with parallelism = 1 - which means
one at a time but in an undefined (random) sequence.
If you also relax your "runtime" expectation to less-than runtime (ie.e. for
example changing for all runs between the times when you yaml file changes -
let's say once a day or once a week) then you could generate your DAG from such
yaml file using Danamic DAG generaiton not Dynamic Tasks Mapping. Where you
simply create tasks and set dependencies between them in the python code when
your file is parsed. Roughly:
```python
@dag
def my_dag():
y = read_yaml().
for task in y.tasks:
task = EMr()
previous_task >> task
previous_task = task
```
This is absolutely classis way og generating DAGs explained in our docs even
explicitly showing yaml file:
https://airflow.apache.org/docs/apache-airflow/stable/howto/dynamic-dag-generation.html#dynamic-dags-with-external-configuration-from-a-structured-data-file
But then, if it changes often and wildely, it's no good use. Such dag should
change slowly - far less frequent (orders of magnitude) than the frequency of
DAG runs.
And yes @nathadfield suggested well that whatever you do (if you will decide
to use Airflow for this - somewhat niche in Airflow world - case for setting up
and tearing down your cluster
Those are - I believe all options you have. now in Airflow.
But as usual, for those who have niche cases - if you figure out a mental
model when it can be generic enough and implementable, proposals on that are
welcome - my feeling is tha that caliber of a use case is somewhat calling for
Airflow Improvement Proposal to prepare, because (if you want to stick to
runtime properties) - it calls for a feature that will allow to have subset of
DAG structure modifications that will not change properties of the graph - for
example just expanding linear graph branch by injecting new tasks to such graph
branch.
And BTW. Converting it into discussion. This is not an issue. This is
discussion on a niche case that you have that is likely not necessary good fit
for Airflow now.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]