Re: [DISCUSS] Airflow as a Map-Reduce kind of framework misconception (or not?)

Kaxil Naik Wed, 25 Aug 2021 10:22:39 -0700

Oh, 100% -- it is a very common use-case and hopefully we will support it
natively soon.


Regards,
Kaxil

On Wed, Aug 25, 2021 at 4:08 PM Jarek Potiuk <[email protected]> wrote:

> Coincidentally I am also on vacation and should not be writing emails :).
>
> Cool. Sounds like again community is heading in the right direction.
>
> J.
>
> śr., 25 sie 2021, 16:41 użytkownik Ash Berlin-Taylor <[email protected]>
> napisał:
>
>> That first line should have said: I'm on holiday this week (so I
>> shouldn't even be reading emails I guess) so sorry for the short response.
>>
>>
>> On 25 August 2021 15:31:27 BST, Ash Berlin-Taylor <[email protected]> wrote:
>>>
>>> I'm on holiday this week (so I shouldn't even be reading emails I guess).
>>>
>>> Such a feature was one of the things I hinted at in my Keynote as I
>>> think Airflow's "static" dags area going to limit the future growth and
>>> adoption of Airflow if we don't change it.
>>>
>>> The "canonical" example I use when taking about this workflow: say your
>>> have a sensor task which lists some files in an S3 bucket, and you want one
>>> downstream task for each file found - I firmly believe that this pattern
>>> belongs in Airflow.
>>>
>>> We (Daniel and I) are working on exactly such a Task splitting proposal
>>> (we've been calling it "dynamic task mapping" which is perhaps not the next
>>> name.) As soon as AIP-39 lands and Airflow 2.2 is released we are going to
>>> start the AIP discussion process.
>>>
>>> Watch this space.
>>>
>>> Ash
>>>
>>> On 25 August 2021 15:07:32 BST, Jarek Potiuk <[email protected]> wrote:
>>>>
>>>> Hello everyone,
>>>>
>>>> I've been involved in a number of discussions recently on slack/stack
>>>> overflow etc. (for example here)
>>>> https://apache-airflow.slack.com/archives/CCQ7EGB1P/p1629809184065600
>>>> where new users of Airflow tried to use it as basically a kind of
>>>> "MapReduce" framework as part of their DAG.
>>>>
>>>> This repeated itself quite a number of times, and I explained over and
>>>> over that Airflow is not the kind of system. I think I've done that 5 or 6
>>>> times already to different users.
>>>>
>>>> It made me think we should do something about it. Not sure what is the
>>>> best route so I am reaching out :).
>>>>
>>>> Short description of a use case:
>>>>
>>>> User has some data to process. They want to split the data in N pieces
>>>> (or maybe it is already split), run N parallel, similar tasks and do
>>>> something with the result. The "N" number depends on some factors (Size of
>>>> data? Day of week ? whatever). But it changes dynamically between different
>>>> runs. One run can have 10 parallel similar tasks, and the next one 20.
>>>>
>>>> My take:
>>>>
>>>> Airflow (currently) is not the kind of system that can handle it using
>>>> DAG structure (And having such parallel tasks as separate tasks). That is
>>>> what MapReduce kind of frameworks do and are efficient in that, but Airflow
>>>> conceptually should not change a number of tasks in it's structiure
>>>> between runs. Usually Airflow can simply orchestrate such external systems,
>>>> and that's my "default" answer.
>>>>
>>>> There are two things we can do, I think:
>>>>
>>>> 1) Improve our docs a bit and mention that specific case and direct
>>>> users to some alternative approaches (tools) that Airflow can orchestrate.
>>>> This is the only way we can address it short-term, I believe.
>>>>
>>>> However, there is clearly a need for our users to do something like
>>>> that as part of the "bigger" DAG. And while using an "external" system to
>>>> do it is the most efficient, and "recommended" way currently, maybe there
>>>> is a class of problems like that where keeping those parallel tasks in
>>>> Airflow MIGHT make sense. Airflow 2 already has a nice, efficient system of
>>>> parallelising tasks and it already has thousands of operators to do stuff,
>>>> so there is a nice property of trying to use those capabilities for such
>>>> "parallel" processing. You could do it without leaving the familiar
>>>> "airflow" ecosystem and Python without invoking any other "specialized"
>>>> service.
>>>>
>>>> And I think it would not be as difficult to imagine that one task in
>>>> Airflow can run in N instances in parallel actually. We would not have to
>>>> change the paradigm of Airflow where DAG structure should be defined
>>>> upfront during parsing. The structure would remain essentially the same -
>>>> only instead of one task, we would invoke N parallel ones. There are some
>>>> problems to solve - of course - but none of them are really huge I think.
>>>>
>>>> So maybe we can also do
>>>>
>>>> 2) implement support for such "task splitting" in Airflow.
>>>>
>>>> I'd love to hear your thoughts about it.
>>>>
>>>> J.
>>>>
>>>>
>>>>
>>>>

Re: [DISCUSS] Airflow as a Map-Reduce kind of framework misconception (or not?)

Reply via email to