xvega commented on PR #62369:
URL: https://github.com/apache/airflow/pull/62369#issuecomment-3977225156

   > Please make sure to test with many very large dag runs - we intentionally 
moved from a single call to 1 per in #51805. There is a dag in there you can 
use.
   
   @jedcunningham Thanks for the heads up! I'm aware of #51805 and this 
implementation intentionally avoids re-introducing that problem.
                                                                                
                                                                                
                           
   The PR actually evolved from an initial batch approach (single WHERE run_id 
IN (all_ids) query) to the current streaming one, precisely because I noticed 
the batch approach was slow on large DAGs, which is exactly what #51805 fixed. 
So I switched to per-run queries inside a streaming generator instead.          
                                                                                
                                                                
                                                               
   The key difference from the original N+1: the streaming endpoint issues one 
WHERE run_id = X query per run inside a generator, the same query pattern as 
#51805. Previous runs' data is processed and yielded before the next run is 
fetched, so memory usage stays bounded to one run at a time.
   
   What I changed is only the HTTP layer: instead of N separate browser 
requests, I open one connection and stream NDJSON lines back as each run is 
ready. Same DB footprint as #51805, but without the N+1 round-trips.
   
   Tested locally (6050 tasks/run, 25 runs = ~151k TIs):
     - Streaming: 7.3s total, first column visible at 0.6s, 1 HTTP connection
     - Per-run parallel (browser, 6 concurrent): 7.8s, first at 1.5s
     - Per-run sequential: 8.3s
     
     
   Looking at the numbers:                                                      
                                                                                
                                                                          
                                            
     - Total time: basically the same (~7-8s) the DB work is identical          
                                                                                
                                                                          
     - Real gain: time to first visible column  0.6s vs 1.5s (2.5× faster 
perceived load)                                                                 
                                                                                
 
     - Connection count: 1 vs 25 no browser connection limit bottleneck         
                                                                                
                                                                          
     - Main benefit at scale: with 25+ runs, browsers cap concurrent requests 
at 6, so the remaining 19 runs queue up. The stream has no such limit, all runs 
process server-side sequentially and arrive as fast as the DB can serve them.
   
   The gain isn't raw throughput, it's perceived responsiveness (columns appear 
progressively) and elimination of connection overhead at scale. On a production 
instance with high latency per request the difference is much more pronounced: 
10 runs × 1.5s = 15s cumulative vs a single stream.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to