I'm on holiday this week (so I shouldn't even be reading emails I guess). Such a feature was one of the things I hinted at in my Keynote as I think Airflow's "static" dags area going to limit the future growth and adoption of Airflow if we don't change it.
The "canonical" example I use when taking about this workflow: say your have a sensor task which lists some files in an S3 bucket, and you want one downstream task for each file found - I firmly believe that this pattern belongs in Airflow. We (Daniel and I) are working on exactly such a Task splitting proposal (we've been calling it "dynamic task mapping" which is perhaps not the next name.) As soon as AIP-39 lands and Airflow 2.2 is released we are going to start the AIP discussion process. Watch this space. Ash On 25 August 2021 15:07:32 BST, Jarek Potiuk <[email protected]> wrote: >Hello everyone, > >I've been involved in a number of discussions recently on slack/stack >overflow etc. (for example here) >https://apache-airflow.slack.com/archives/CCQ7EGB1P/p1629809184065600 where >new users of Airflow tried to use it as basically a kind of "MapReduce" >framework as part of their DAG. > >This repeated itself quite a number of times, and I explained over and over >that Airflow is not the kind of system. I think I've done that 5 or 6 times >already to different users. > >It made me think we should do something about it. Not sure what is the best >route so I am reaching out :). > >Short description of a use case: > >User has some data to process. They want to split the data in N pieces (or >maybe it is already split), run N parallel, similar tasks and do something >with the result. The "N" number depends on some factors (Size of data? Day >of week ? whatever). But it changes dynamically between different runs. One >run can have 10 parallel similar tasks, and the next one 20. > >My take: > >Airflow (currently) is not the kind of system that can handle it using DAG >structure (And having such parallel tasks as separate tasks). That is what >MapReduce kind of frameworks do and are efficient in that, but Airflow >conceptually should not change a number of tasks in it's structiure >between runs. Usually Airflow can simply orchestrate such external systems, >and that's my "default" answer. > >There are two things we can do, I think: > >1) Improve our docs a bit and mention that specific case and direct users >to some alternative approaches (tools) that Airflow can orchestrate. This >is the only way we can address it short-term, I believe. > >However, there is clearly a need for our users to do something like that as >part of the "bigger" DAG. And while using an "external" system to do it is >the most efficient, and "recommended" way currently, maybe there is a class >of problems like that where keeping those parallel tasks in Airflow MIGHT >make sense. Airflow 2 already has a nice, efficient system of parallelising >tasks and it already has thousands of operators to do stuff, so there is a >nice property of trying to use those capabilities for such "parallel" >processing. You could do it without leaving the familiar "airflow" >ecosystem and Python without invoking any other "specialized" service. > >And I think it would not be as difficult to imagine that one task in >Airflow can run in N instances in parallel actually. We would not have to >change the paradigm of Airflow where DAG structure should be defined >upfront during parsing. The structure would remain essentially the same - >only instead of one task, we would invoke N parallel ones. There are some >problems to solve - of course - but none of them are really huge I think. > >So maybe we can also do > >2) implement support for such "task splitting" in Airflow. > >I'd love to hear your thoughts about it. > >J.
