seanmuth commented on PR #67127: URL: https://github.com/apache/airflow/pull/67127#issuecomment-4481473812
Drive-by from #67123 — confirming this matches the fix I proposed there. Both shapes (`@lru_cache` on a thin wrapper keyed on `team_name`) are functionally equivalent; preserves AIP-67 isolation while restoring per-subprocess amortization of the result-backend resolution that pre-3.16.0 was getting from the module-level `app` singleton. Sharing some empirical numbers I gathered while reproducing #67123, in case they're useful for the reviewer: **Per-publish overhead, measured as `create_celery_app(conf); _ = app.backend` (the path inside `[celery] operation_timeout`).** Astro Runtime 13.4.0 (Airflow 2.11.0), 1 vCPU / 2 GiB scheduler, 656 installed distributions, 100 iterations. | Configuration | min | p50 | p95 | p99 | max | iter/s | |---|---:|---:|---:|---:|---:|---:| | Idle, no load | 48ms | 51ms | 67ms | 82ms | 86ms | 18.8 | | Idle + 1.5 GiB anon allocation | 48 | 52 | 89 | 95 | 100 | 17.5 | | 5 modest Dags with mapped tasks running | 52 | 89 | 232 | 363 | 385 | 9.6 | | Same Dag load + 1.0 GiB anon allocation | 51 | 112 | 498 | 698 | 718 | 5.7 | Reference: a real production deployment with the regression (~682 distributions, ~640 active Dags, periodic scheduler OOMs) shows `max=558ms` with consistent `Task Timeout Error for Task` log entries firing at the default 1.0s `operation_timeout`. Downgrading to providers-celery 3.15.x in the same image collapses the per-publish path to microseconds (singleton's backend is already resolved), confirming the regression's scope. The takeaway for sizing the fix's expected impact: with this cache in place, the per-subprocess first-publish cost is paid once per team_name and then amortized over the subprocess's lifetime, restoring the pre-3.16.0 behavior. `apply_async` afterwards is just the broker round-trip plus a cached attribute access. A couple of small things you might want to consider rolling in (entirely optional — happy if you ignore): 1. A docstring on `_get_celery_app_for_workload` linking to the issue and briefly noting *why* the cache exists — future readers wondering about the indirection will appreciate the breadcrumb. Without context, "subprocess-local Celery app cached by team name for task publishing" reads like incidental optimization rather than a regression fix. 2. A third unit test that exercises `_get_celery_app_for_workload` directly (without going through `send_workload_to_executor`) — asserting that consecutive calls for the same team return the same object, and that different team names return distinct objects. The two existing tests cover the end-to-end behavior well; a direct-on-the-cache test would catch refactors that move work out of `_get_celery_app_for_workload` without breaking the integration path. Comparison point in case it's useful: PR #61798 implemented the same AIP-67 multi-team feature for `KubernetesExecutor` without the per-publish cost because that executor's design naturally holds the team-aware `KubeConfig` once at `__init__`. Worth noting that the per-publish reconstruction wasn't an inherent AIP-67 requirement — it was an artifact of the `ProcessPoolExecutor.submit(...)` shape the Celery publisher pool already had. This PR effectively brings the Celery side back in line with that pattern. Either way, the fix is correct and I'd love to see it merged. Thanks for picking this up so quickly. --- Drafted-by: Claude Code (Opus 4.7); reviewed by @seanmuth before posting -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
