RE: [PROPOSAL] Add streaming support to PartialOperator

Blain David Wed, 18 Sep 2024 10:35:11 -0700

The example I give was a simplified version, and also a continuation of another 
DAG process.


The issue I tried to solve in Airflow here for this case (we have also other 
use cases where we ran into the same issue) was reading n number of users from 
MSGraph, which where updated and had to be synchronized in our datawarehouse.

The problem is that for each user, we also then need to update the groups it 
belongs to, the devices and the licenses for each user, and so on.

Unfortunately, those 3 things I just mentioned need a dedicated MSGraph calls 
per user, you can't get his information in one call nor even combined with the 
updated users call, you have to do it all individually.

So in the above example you would get 3 additional calls per updated user, 
which means 3 extra MSGraph calls.  If you have like 1k updated users, that 
would mean 3k dynamic tasks.

My first approach was using dynamic tasks, but that exploded very quickly as 
explained above, as each updated user will trigger 3 calls, and users get 
updated frequently.
For example an updated permission/role for a user will trigger an update, if 
you have 70k+ users, it can grow quickly.

The original job is running in custom python code using RxPy 
(https://github.com/ReactiveX/RxPY) which is using the reactive programming 
methodology,
but we want to step away from it because everything is custom regarding 
invoking msgraph as well as writing to the database in the code but also the 
CI/CD involved in maintaining this project.
We want to move away from custom code and have native Airflow jobs, and I 
personally think this case is perfectly possible in Airflow, at least if we 
would have the "streaming" option, which I use now and works fine.

-----Original Message-----
From: Daniel Standish <daniel.stand...@astronomer.io.INVALID> 
Sent: Wednesday, September 18, 2024 6:41 PM
To: dev@airflow.apache.org
Subject: Re: [PROPOSAL] Add streaming support to PartialOperator

EXTERNAL MAIL: Indien je de afzender van deze e-mail niet kent en deze niet 
vertrouwt, klik niet op een link of open geen bijlages. Bij twijfel, stuur deze 
e-mail als bijlage naar ab...@infrabel.be<mailto:ab...@infrabel.be>.

Curious why you want to model this as many tasks, e.g. one page == one task.

Another option would be to handle many pages in one task.  And I'm curious what 
were the factors that led you to split it out more granularly.

RE: [PROPOSAL] Add streaming support to PartialOperator

Reply via email to