Coincidentally I am also on vacation and should not be writing emails :). Cool. Sounds like again community is heading in the right direction.
J. śr., 25 sie 2021, 16:41 użytkownik Ash Berlin-Taylor <[email protected]> napisał: > That first line should have said: I'm on holiday this week (so I shouldn't > even be reading emails I guess) so sorry for the short response. > > > On 25 August 2021 15:31:27 BST, Ash Berlin-Taylor <[email protected]> wrote: >> >> I'm on holiday this week (so I shouldn't even be reading emails I guess). >> >> Such a feature was one of the things I hinted at in my Keynote as I think >> Airflow's "static" dags area going to limit the future growth and adoption >> of Airflow if we don't change it. >> >> The "canonical" example I use when taking about this workflow: say your >> have a sensor task which lists some files in an S3 bucket, and you want one >> downstream task for each file found - I firmly believe that this pattern >> belongs in Airflow. >> >> We (Daniel and I) are working on exactly such a Task splitting proposal >> (we've been calling it "dynamic task mapping" which is perhaps not the next >> name.) As soon as AIP-39 lands and Airflow 2.2 is released we are going to >> start the AIP discussion process. >> >> Watch this space. >> >> Ash >> >> On 25 August 2021 15:07:32 BST, Jarek Potiuk <[email protected]> wrote: >>> >>> Hello everyone, >>> >>> I've been involved in a number of discussions recently on slack/stack >>> overflow etc. (for example here) >>> https://apache-airflow.slack.com/archives/CCQ7EGB1P/p1629809184065600 >>> where new users of Airflow tried to use it as basically a kind of >>> "MapReduce" framework as part of their DAG. >>> >>> This repeated itself quite a number of times, and I explained over and >>> over that Airflow is not the kind of system. I think I've done that 5 or 6 >>> times already to different users. >>> >>> It made me think we should do something about it. Not sure what is the >>> best route so I am reaching out :). >>> >>> Short description of a use case: >>> >>> User has some data to process. They want to split the data in N pieces >>> (or maybe it is already split), run N parallel, similar tasks and do >>> something with the result. The "N" number depends on some factors (Size of >>> data? Day of week ? whatever). But it changes dynamically between different >>> runs. One run can have 10 parallel similar tasks, and the next one 20. >>> >>> My take: >>> >>> Airflow (currently) is not the kind of system that can handle it using >>> DAG structure (And having such parallel tasks as separate tasks). That is >>> what MapReduce kind of frameworks do and are efficient in that, but Airflow >>> conceptually should not change a number of tasks in it's structiure >>> between runs. Usually Airflow can simply orchestrate such external systems, >>> and that's my "default" answer. >>> >>> There are two things we can do, I think: >>> >>> 1) Improve our docs a bit and mention that specific case and direct >>> users to some alternative approaches (tools) that Airflow can orchestrate. >>> This is the only way we can address it short-term, I believe. >>> >>> However, there is clearly a need for our users to do something like that >>> as part of the "bigger" DAG. And while using an "external" system to do it >>> is the most efficient, and "recommended" way currently, maybe there is a >>> class of problems like that where keeping those parallel tasks in Airflow >>> MIGHT make sense. Airflow 2 already has a nice, efficient system of >>> parallelising tasks and it already has thousands of operators to do stuff, >>> so there is a nice property of trying to use those capabilities for such >>> "parallel" processing. You could do it without leaving the familiar >>> "airflow" ecosystem and Python without invoking any other "specialized" >>> service. >>> >>> And I think it would not be as difficult to imagine that one task in >>> Airflow can run in N instances in parallel actually. We would not have to >>> change the paradigm of Airflow where DAG structure should be defined >>> upfront during parsing. The structure would remain essentially the same - >>> only instead of one task, we would invoke N parallel ones. There are some >>> problems to solve - of course - but none of them are really huge I think. >>> >>> So maybe we can also do >>> >>> 2) implement support for such "task splitting" in Airflow. >>> >>> I'd love to hear your thoughts about it. >>> >>> J. >>> >>> >>> >>>
