seanmuth commented on PR #67127:
URL: https://github.com/apache/airflow/pull/67127#issuecomment-4481473812

   Drive-by from #67123 — confirming this matches the fix I proposed there. 
Both shapes (`@lru_cache` on a thin wrapper keyed on `team_name`) are 
functionally equivalent; preserves AIP-67 isolation while restoring 
per-subprocess amortization of the result-backend resolution that pre-3.16.0 
was getting from the module-level `app` singleton.
   
   Sharing some empirical numbers I gathered while reproducing #67123, in case 
they're useful for the reviewer:
   
   **Per-publish overhead, measured as `create_celery_app(conf); _ = 
app.backend` (the path inside `[celery] operation_timeout`).** Astro Runtime 
13.4.0 (Airflow 2.11.0), 1 vCPU / 2 GiB scheduler, 656 installed distributions, 
100 iterations.
   
   | Configuration | min | p50 | p95 | p99 | max | iter/s |
   |---|---:|---:|---:|---:|---:|---:|
   | Idle, no load | 48ms | 51ms | 67ms | 82ms | 86ms | 18.8 |
   | Idle + 1.5 GiB anon allocation | 48 | 52 | 89 | 95 | 100 | 17.5 |
   | 5 modest Dags with mapped tasks running | 52 | 89 | 232 | 363 | 385 | 9.6 |
   | Same Dag load + 1.0 GiB anon allocation | 51 | 112 | 498 | 698 | 718 | 5.7 
|
   
   Reference: a real production deployment with the regression (~682 
distributions, ~640 active Dags, periodic scheduler OOMs) shows `max=558ms` 
with consistent `Task Timeout Error for Task` log entries firing at the default 
1.0s `operation_timeout`. Downgrading to providers-celery 3.15.x in the same 
image collapses the per-publish path to microseconds (singleton's backend is 
already resolved), confirming the regression's scope.
   
   The takeaway for sizing the fix's expected impact: with this cache in place, 
the per-subprocess first-publish cost is paid once per team_name and then 
amortized over the subprocess's lifetime, restoring the pre-3.16.0 behavior. 
`apply_async` afterwards is just the broker round-trip plus a cached attribute 
access.
   
   A couple of small things you might want to consider rolling in (entirely 
optional — happy if you ignore):
   
   1. A docstring on `_get_celery_app_for_workload` linking to the issue and 
briefly noting *why* the cache exists — future readers wondering about the 
indirection will appreciate the breadcrumb. Without context, "subprocess-local 
Celery app cached by team name for task publishing" reads like incidental 
optimization rather than a regression fix.
   
   2. A third unit test that exercises `_get_celery_app_for_workload` directly 
(without going through `send_workload_to_executor`) — asserting that 
consecutive calls for the same team return the same object, and that different 
team names return distinct objects. The two existing tests cover the end-to-end 
behavior well; a direct-on-the-cache test would catch refactors that move work 
out of `_get_celery_app_for_workload` without breaking the integration path.
   
   Comparison point in case it's useful: PR #61798 implemented the same AIP-67 
multi-team feature for `KubernetesExecutor` without the per-publish cost 
because that executor's design naturally holds the team-aware `KubeConfig` once 
at `__init__`. Worth noting that the per-publish reconstruction wasn't an 
inherent AIP-67 requirement — it was an artifact of the 
`ProcessPoolExecutor.submit(...)` shape the Celery publisher pool already had. 
This PR effectively brings the Celery side back in line with that pattern.
   
   Either way, the fix is correct and I'd love to see it merged. Thanks for 
picking this up so quickly.
   
   ---
   Drafted-by: Claude Code (Opus 4.7); reviewed by @seanmuth before posting
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to