Hello everyone, I've been involved in a number of discussions recently on slack/stack overflow etc. (for example here) https://apache-airflow.slack.com/archives/CCQ7EGB1P/p1629809184065600 where new users of Airflow tried to use it as basically a kind of "MapReduce" framework as part of their DAG.
This repeated itself quite a number of times, and I explained over and over that Airflow is not the kind of system. I think I've done that 5 or 6 times already to different users. It made me think we should do something about it. Not sure what is the best route so I am reaching out :). Short description of a use case: User has some data to process. They want to split the data in N pieces (or maybe it is already split), run N parallel, similar tasks and do something with the result. The "N" number depends on some factors (Size of data? Day of week ? whatever). But it changes dynamically between different runs. One run can have 10 parallel similar tasks, and the next one 20. My take: Airflow (currently) is not the kind of system that can handle it using DAG structure (And having such parallel tasks as separate tasks). That is what MapReduce kind of frameworks do and are efficient in that, but Airflow conceptually should not change a number of tasks in it's structiure between runs. Usually Airflow can simply orchestrate such external systems, and that's my "default" answer. There are two things we can do, I think: 1) Improve our docs a bit and mention that specific case and direct users to some alternative approaches (tools) that Airflow can orchestrate. This is the only way we can address it short-term, I believe. However, there is clearly a need for our users to do something like that as part of the "bigger" DAG. And while using an "external" system to do it is the most efficient, and "recommended" way currently, maybe there is a class of problems like that where keeping those parallel tasks in Airflow MIGHT make sense. Airflow 2 already has a nice, efficient system of parallelising tasks and it already has thousands of operators to do stuff, so there is a nice property of trying to use those capabilities for such "parallel" processing. You could do it without leaving the familiar "airflow" ecosystem and Python without invoking any other "specialized" service. And I think it would not be as difficult to imagine that one task in Airflow can run in N instances in parallel actually. We would not have to change the paradigm of Airflow where DAG structure should be defined upfront during parsing. The structure would remain essentially the same - only instead of one task, we would invoke N parallel ones. There are some problems to solve - of course - but none of them are really huge I think. So maybe we can also do 2) implement support for such "task splitting" in Airflow. I'd love to hear your thoughts about it. J.
