shahar1 opened a new pull request, #67747: URL: https://github.com/apache/airflow/pull/67747
## Summary Reduce DAG file-queue de-duplication cost from O(N²) to O(N) by replacing `collections.deque` with `OrderedDict[DagFileInfo, None]`. Membership testing (`in`), push-front (`move_to_end`), and pop-front (`popitem`) are all O(1), eliminating the quadratic cost in the `frontprio` and re-add paths. ## Impact **Benchmark results** (best-of-N, ms): | path | files | before | after | speedup | |---|---|---|---|---| | frontprio re-add | 4000 | 2320.7 | 3.82 | ~610× | | front re-add | 4000 | 2299.7 | 2.94 | ~780× | | incremental drip | 4000 | 2292.7 | 8.11 | ~280× | The normalized `ms/N²` column flips from ~142 (quadratic) to ~0.1 (linear), confirming the fix. ## Implementation - Swap `_file_queue: deque[DagFileInfo]` → `OrderedDict[DagFileInfo, None]` (ordered set) - Update `_add_files_to_queue`, `_start_new_processes`, `purge_removed_files_from_queue`, `_resort_file_queue` - Migrate tests from `deque(...)` to `OrderedDict.fromkeys(...)` - Remove unused `import contextlib` Behavior is byte-identical to the old deque (verified over 300 random ops). Bonus: OrderedDict also de-dups within incoming batches. ## Testing - All 116 `test_manager.py` tests pass - Benchmarks in `dev/dag_processing_benchmarks/` for regression tracking --- ##### Was generative AI tooling used to co-author this PR? - [X] Yes — Claude Haiku 4.5 Generated-by: Claude Haiku 4.5 following [the guidelines](https://github.com/apache/airflow/blob/main/contributing-docs/05_pull_requests.rst#gen-ai-assisted-contributions) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
