Re: [DISCUSS] Airflow as a Map-Reduce kind of framework misconception (or not?)

Jens Larsson Fri, 10 Sep 2021 01:20:32 -0700

I think there are some operators in particular that could really benefit
from this. One that comes to mind is the cassandra_to_gcs_operator
<https://airflow.apache.org/docs/apache-airflow-providers-google/stable/_modules/airflow/providers/google/cloud/transfers/cassandra_to_gcs.html>
,
as Cassandra is generally used for Terabyte+ data volumes, and are not
performant for full table scans (optimized for key lookup).


Hence it makes sense to break the job into small pieces - as one large
multi-hour job is likely to crash and lose all it's progress. We have tasks
similar to this that run for 10+ hours and often fail and have to restart.

I wouldn't think it's a good idea to generate hundreds, or thousands, of
tasks in the DAG, but somehow make the single task Operator itself create
several executions - if e.g. the KubernetesExecutor is used, it could
create >1 pod for the same task, and initialize them with some split
parameter.

/Jens



On Wed, Aug 25, 2021 at 7:22 PM Kaxil Naik <[email protected]> wrote:

> Oh, 100% -- it is a very common use-case and hopefully we will support it
> natively soon.
>
> Regards,
> Kaxil
>
> On Wed, Aug 25, 2021 at 4:08 PM Jarek Potiuk <[email protected]> wrote:
>
>> Coincidentally I am also on vacation and should not be writing emails :).
>>
>> Cool. Sounds like again community is heading in the right direction.
>>
>> J.
>>
>> śr., 25 sie 2021, 16:41 użytkownik Ash Berlin-Taylor <[email protected]>
>> napisał:
>>
>>> That first line should have said: I'm on holiday this week (so I
>>> shouldn't even be reading emails I guess) so sorry for the short response.
>>>
>>>
>>> On 25 August 2021 15:31:27 BST, Ash Berlin-Taylor <[email protected]>
>>> wrote:
>>>>
>>>> I'm on holiday this week (so I shouldn't even be reading emails I
>>>> guess).
>>>>
>>>> Such a feature was one of the things I hinted at in my Keynote as I
>>>> think Airflow's "static" dags area going to limit the future growth and
>>>> adoption of Airflow if we don't change it.
>>>>
>>>> The "canonical" example I use when taking about this workflow: say your
>>>> have a sensor task which lists some files in an S3 bucket, and you want one
>>>> downstream task for each file found - I firmly believe that this pattern
>>>> belongs in Airflow.
>>>>
>>>> We (Daniel and I) are working on exactly such a Task splitting proposal
>>>> (we've been calling it "dynamic task mapping" which is perhaps not the next
>>>> name.) As soon as AIP-39 lands and Airflow 2.2 is released we are going to
>>>> start the AIP discussion process.
>>>>
>>>> Watch this space.
>>>>
>>>> Ash
>>>>
>>>> On 25 August 2021 15:07:32 BST, Jarek Potiuk <[email protected]> wrote:
>>>>>
>>>>> Hello everyone,
>>>>>
>>>>> I've been involved in a number of discussions recently on slack/stack
>>>>> overflow etc. (for example here)
>>>>> https://apache-airflow.slack.com/archives/CCQ7EGB1P/p1629809184065600
>>>>> where new users of Airflow tried to use it as basically a kind of
>>>>> "MapReduce" framework as part of their DAG.
>>>>>
>>>>> This repeated itself quite a number of times, and I explained over and
>>>>> over that Airflow is not the kind of system. I think I've done that 5 or 6
>>>>> times already to different users.
>>>>>
>>>>> It made me think we should do something about it. Not sure what is the
>>>>> best route so I am reaching out :).
>>>>>
>>>>> Short description of a use case:
>>>>>
>>>>> User has some data to process. They want to split the data in N pieces
>>>>> (or maybe it is already split), run N parallel, similar tasks and do
>>>>> something with the result. The "N" number depends on some factors (Size of
>>>>> data? Day of week ? whatever). But it changes dynamically between 
>>>>> different
>>>>> runs. One run can have 10 parallel similar tasks, and the next one 20.
>>>>>
>>>>> My take:
>>>>>
>>>>> Airflow (currently) is not the kind of system that can handle it using
>>>>> DAG structure (And having such parallel tasks as separate tasks). That is
>>>>> what MapReduce kind of frameworks do and are efficient in that, but 
>>>>> Airflow
>>>>> conceptually should not change a number of tasks in it's structiure
>>>>> between runs. Usually Airflow can simply orchestrate such external 
>>>>> systems,
>>>>> and that's my "default" answer.
>>>>>
>>>>> There are two things we can do, I think:
>>>>>
>>>>> 1) Improve our docs a bit and mention that specific case and direct
>>>>> users to some alternative approaches (tools) that Airflow can orchestrate.
>>>>> This is the only way we can address it short-term, I believe.
>>>>>
>>>>> However, there is clearly a need for our users to do something like
>>>>> that as part of the "bigger" DAG. And while using an "external" system to
>>>>> do it is the most efficient, and "recommended" way currently, maybe there
>>>>> is a class of problems like that where keeping those parallel tasks in
>>>>> Airflow MIGHT make sense. Airflow 2 already has a nice, efficient system 
>>>>> of
>>>>> parallelising tasks and it already has thousands of operators to do stuff,
>>>>> so there is a nice property of trying to use those capabilities for such
>>>>> "parallel" processing. You could do it without leaving the familiar
>>>>> "airflow" ecosystem and Python without invoking any other "specialized"
>>>>> service.
>>>>>
>>>>> And I think it would not be as difficult to imagine that one task in
>>>>> Airflow can run in N instances in parallel actually. We would not have to
>>>>> change the paradigm of Airflow where DAG structure should be defined
>>>>> upfront during parsing. The structure would remain essentially the same -
>>>>> only instead of one task, we would invoke N parallel ones. There are some
>>>>> problems to solve - of course - but none of them are really huge I think.
>>>>>
>>>>> So maybe we can also do
>>>>>
>>>>> 2) implement support for such "task splitting" in Airflow.
>>>>>
>>>>> I'd love to hear your thoughts about it.
>>>>>
>>>>> J.
>>>>>
>>>>>
>>>>>
>>>>>

-- 

Jens Larsson

Head of Data & Analytics

+46 70 269 00 89

[email protected]



Tink AB

Vasagatan 11

111 20 Stockholm, Sweden

tink.com

T&Cs & Privacy Policies <https://business.tink.se/our-privacy-policies>

Re: [DISCUSS] Airflow as a Map-Reduce kind of framework misconception (or not?)

Reply via email to