The GitHub Actions job "Tests (AMD)" on airflow.git/v3-2-test has succeeded. Run started by GitHub user vatsrahul1001 (triggered by vatsrahul1001).
Head commit for run: 0dda7d44c8c9949280c9a80a2d2b41c4ccbcba3e / Rahul Vats <[email protected]> Add configurable LRU+TTL caching for API server DAG retrieval (#60804) (#66862) Fixes memory growth in long-running API servers by adding bounded LRU+TTL caching to `DBDagBag`. Previously, the internal dict cache never expired and never evicted, causing memory to grow indefinitely as DAG versions accumulated (~500 MB/day with 100+ DAGs updating daily). Two new `[api]` config options control caching: | Config | Default | Description | |--------|---------|-------------| | `dag_cache_size` | `64` | Max cached DAG versions (0 = unbounded dict, no eviction) | | `dag_cache_ttl` | `3600` | TTL in seconds (0 = LRU only, no time-based expiry) | **API server only.** The scheduler continues using a plain unbounded dict with zero lock overhead (`nullcontext` instead of `RLock`). The bounded cache + lock is only created when `cache_size > 0`. **Cache thrashing prevention.** `iter_all_latest_version_dags()` (used by the DAG listing endpoint) bypasses the cache entirely. Without this, every DAG listing request would flush the hot working set and replace it with a full scan of all DAGs. **Double-checked locking.** When multiple threads miss on the same `version_id` concurrently, only the first thread queries the DB. The rest find it cached after acquiring the lock. Metrics are emitted correctly: a single lookup never counts as both a hit and a miss. **Separate model cache.** `get_serialized_dag_model()` maintains its own dict cache. The triggerer needs the full `SerializedDagModel` (for `.data`), not the deserialized `SerializedDAG` stored in the LRU/TTL cache. **Cache keying.** The cache is keyed by DAG version ID. Lookups by `dag_id` (e.g., viewing a DAG's details) always query the DB for the latest version, but the deserialized result is cached for subsequent version-specific lookups (e.g., task instance views for a specific DAG run). **Staleness.** After a DAG is updated, the API server may serve the previous version until the cached entry expires (controlled by `dag_cache_ttl`). This is documented in the config description. **Why `cachetools`.** `cachetools` is a small, pure-Python library (~1K LOC) already present as a transitive dependency via `google-auth`. It provides battle-tested `LRUCache` and `TTLCache` implementations. Pinned at `>=6.0.0` to match the FAB provider. **Why `RLock`.** `cachetools` caches are NOT thread-safe -- `.get()` mutates internal doubly-linked lists (LRU reordering) and TTL access triggers cleanup. Without synchronization, concurrent access can corrupt the data structure. | Metric | Type | Description | |--------|------|-------------| | `api_server.dag_bag.cache_hit` | Counter | Cache hits (including double-checked locking hits) | | `api_server.dag_bag.cache_miss` | Counter | Confirmed misses (after double-check) | | `api_server.dag_bag.cache_clear` | Counter | Cache clears | | `api_server.dag_bag.cache_size` | Gauge | Current cache size (sampled at 10%) | - Default behavior unchanged for scheduler and triggerer (unbounded dict, no lock) - API server gets caching by default (`dag_cache_size=64`, `dag_cache_ttl=3600`) - Use `dag_cache_size=0` to restore pre-change behavior (unbounded dict) - No breaking changes to public APIs; `get_serialized_dag_model()` and `get_dag()` signatures preserved - #64326 (closed) -- similar fix with OrderedDict-based LRU, no TTL - #60940 (merged) -- gunicorn support with rolling worker restarts (complementary, handles memory growth from any source) (cherry picked from commit 26cbdcbe948c105322fee64064b24697f03b9dc1) Co-authored-by: Kaxil Naik <[email protected]> Report URL: https://github.com/apache/airflow/actions/runs/25920089914 With regards, GitHub Actions via GitBox --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
