Re: [DISCUSS] Airflow as a Map-Reduce kind of framework misconception (or not?)

Ash Berlin-Taylor Wed, 25 Aug 2021 07:41:47 -0700

That first line should have said: I'm on holiday this week (so I shouldn't even 
be reading emails I guess) so sorry for the short response.



On 25 August 2021 15:31:27 BST, Ash Berlin-Taylor <[email protected]> wrote:
>I'm on holiday this week (so I shouldn't even be reading emails I guess).
>
>Such a feature was one of the things I hinted at in my Keynote as I think 
>Airflow's "static" dags area going to limit the future growth and adoption of 
>Airflow if we don't change it.
>
>The "canonical" example I use when taking about this workflow: say your have a 
>sensor task which lists some files in an S3 bucket, and you want one 
>downstream task for each file found - I firmly believe that this pattern 
>belongs in Airflow.
>
>We (Daniel and I) are working on exactly such a Task splitting proposal (we've 
>been calling it "dynamic task mapping" which is perhaps not the next name.) As 
>soon as AIP-39 lands and Airflow 2.2 is released we are going to start the AIP 
>discussion process.
>
>Watch this space.
>
>Ash
>
>On 25 August 2021 15:07:32 BST, Jarek Potiuk <[email protected]> wrote:
>>Hello everyone,
>>
>>I've been involved in a number of discussions recently on slack/stack
>>overflow etc. (for example here)
>>https://apache-airflow.slack.com/archives/CCQ7EGB1P/p1629809184065600 where
>>new users of Airflow tried to use it as basically a kind of "MapReduce"
>>framework as part of their DAG.
>>
>>This repeated itself quite a number of times, and I explained over and over
>>that Airflow is not the kind of system. I think I've done that 5 or 6 times
>>already to different users.
>>
>>It made me think we should do something about it. Not sure what is the best
>>route so I am reaching out :).
>>
>>Short description of a use case:
>>
>>User has some data to process. They want to split the data in N pieces (or
>>maybe it is already split), run N parallel, similar tasks and do something
>>with the result. The "N" number depends on some factors (Size of data? Day
>>of week ? whatever). But it changes dynamically between different runs. One
>>run can have 10 parallel similar tasks, and the next one 20.
>>
>>My take:
>>
>>Airflow (currently) is not the kind of system that can handle it using DAG
>>structure (And having such parallel tasks as separate tasks). That is what
>>MapReduce kind of frameworks do and are efficient in that, but Airflow
>>conceptually should not change a number of tasks in it's structiure
>>between runs. Usually Airflow can simply orchestrate such external systems,
>>and that's my "default" answer.
>>
>>There are two things we can do, I think:
>>
>>1) Improve our docs a bit and mention that specific case and direct users
>>to some alternative approaches (tools) that Airflow can orchestrate. This
>>is the only way we can address it short-term, I believe.
>>
>>However, there is clearly a need for our users to do something like that as
>>part of the "bigger" DAG. And while using an "external" system to do it is
>>the most efficient, and "recommended" way currently, maybe there is a class
>>of problems like that where keeping those parallel tasks in Airflow MIGHT
>>make sense. Airflow 2 already has a nice, efficient system of parallelising
>>tasks and it already has thousands of operators to do stuff, so there is a
>>nice property of trying to use those capabilities for such "parallel"
>>processing. You could do it without leaving the familiar "airflow"
>>ecosystem and Python without invoking any other "specialized" service.
>>
>>And I think it would not be as difficult to imagine that one task in
>>Airflow can run in N instances in parallel actually. We would not have to
>>change the paradigm of Airflow where DAG structure should be defined
>>upfront during parsing. The structure would remain essentially the same -
>>only instead of one task, we would invoke N parallel ones. There are some
>>problems to solve - of course - but none of them are really huge I think.
>>
>>So maybe we can also do
>>
>>2) implement support for such "task splitting" in Airflow.
>>
>>I'd love to hear your thoughts about it.
>>
>>J.

Re: [DISCUSS] Airflow as a Map-Reduce kind of framework misconception (or not?)

Reply via email to