Hello everyone,

I've been involved in a number of discussions recently on slack/stack
overflow etc. (for example here)
https://apache-airflow.slack.com/archives/CCQ7EGB1P/p1629809184065600 where
new users of Airflow tried to use it as basically a kind of "MapReduce"
framework as part of their DAG.

This repeated itself quite a number of times, and I explained over and over
that Airflow is not the kind of system. I think I've done that 5 or 6 times
already to different users.

It made me think we should do something about it. Not sure what is the best
route so I am reaching out :).

Short description of a use case:

User has some data to process. They want to split the data in N pieces (or
maybe it is already split), run N parallel, similar tasks and do something
with the result. The "N" number depends on some factors (Size of data? Day
of week ? whatever). But it changes dynamically between different runs. One
run can have 10 parallel similar tasks, and the next one 20.

My take:

Airflow (currently) is not the kind of system that can handle it using DAG
structure (And having such parallel tasks as separate tasks). That is what
MapReduce kind of frameworks do and are efficient in that, but Airflow
conceptually should not change a number of tasks in it's structiure
between runs. Usually Airflow can simply orchestrate such external systems,
and that's my "default" answer.

There are two things we can do, I think:

1) Improve our docs a bit and mention that specific case and direct users
to some alternative approaches (tools) that Airflow can orchestrate. This
is the only way we can address it short-term, I believe.

However, there is clearly a need for our users to do something like that as
part of the "bigger" DAG. And while using an "external" system to do it is
the most efficient, and "recommended" way currently, maybe there is a class
of problems like that where keeping those parallel tasks in Airflow MIGHT
make sense. Airflow 2 already has a nice, efficient system of parallelising
tasks and it already has thousands of operators to do stuff, so there is a
nice property of trying to use those capabilities for such "parallel"
processing. You could do it without leaving the familiar "airflow"
ecosystem and Python without invoking any other "specialized" service.

And I think it would not be as difficult to imagine that one task in
Airflow can run in N instances in parallel actually. We would not have to
change the paradigm of Airflow where DAG structure should be defined
upfront during parsing. The structure would remain essentially the same -
only instead of one task, we would invoke N parallel ones. There are some
problems to solve - of course - but none of them are really huge I think.

So maybe we can also do

2) implement support for such "task splitting" in Airflow.

I'd love to hear your thoughts about it.

J.

Reply via email to