diogosilva30 opened a new issue, #68077:
URL: https://github.com/apache/airflow/issues/68077

   ### Apache Airflow version
   
   3.2.2 (also affects `main`).
   
   ### What happened
   
   After upgrading to Airflow 3.2.x with the `edge3` provider (3.7.0), all 
`airflow_edge_worker_*` metrics disappeared from the metrics backend 
(StatsD/Prometheus). Other Airflow metrics (`scheduler.*`, `dag_processing.*`, 
`api_server.*`, pool metrics) kept working.
   
   ### What you think should happen instead
   
   Edge Worker metrics (`edge_worker.connected`, `edge_worker.num_queues`, 
`edge_worker.heartbeat_count`, `edge_worker.ti.*`, etc.) should be emitted as 
before.
   
   ### Root cause
   
   The Edge Worker REST API is served by the **API server** 
(`/edge_worker/v1/...`). A worker heartbeat (`PATCH 
/edge_worker/v1/worker/<name>`) runs `set_state` → `set_metrics`, which records 
metrics through the **Task SDK** `Stats` singleton 
(`airflow.sdk._shared.observability.metrics.stats`, resolved by the Edge 
provider via `airflow.providers.common.compat`).
   
   Every other long-running component initializes that singleton:
   
   - core stats: `jobs/scheduler_job_runner.py`, 
`jobs/triggerer_job_runner.py`, `dag_processing/manager.py`, 
`executors/base_executor.py`
   - SDK stats: `task-sdk/.../execution_time/task_runner.py`, 
`task-sdk/.../serde/__init__.py`
   
   …all call `stats.initialize(factory=stats_utils.get_stats_factory(), 
export_legacy_names=...)`.
   
   The **API server never calls `stats.initialize(...)`**. Before #63932 
(*Remove the DualStatsManager and the Stats interfaces*), `Stats` lazily 
auto-initialized its backend on first use. #63932 replaced that with explicit 
`Stats.initialize(...)` + a PID guard, and added the explicit call to the 
components above — but **not** to the API server. As a result the Task SDK 
`Stats` singleton in the API server process stays a `NoStatsLogger`, and every 
Edge Worker metric is silently dropped.
   
   This also explains the asymmetry that `api_server.*` metrics still work: 
they go through the separately-initialized **core** stats path, while the Edge 
metrics use the uninitialized **SDK** path.
   
   ### Minimal reproduction
   
   1. Airflow 3.2.x, `EdgeExecutor`, `edge3` 3.7.0, metrics enabled (`[metrics] 
statsd_on = True`).
   2. Start an edge worker so it heartbeats against the API server.
   3. Scrape the metrics backend → no `edge_worker.*` series.
   4. In the API server process: `from airflow.providers.common.compat.sdk 
import Stats; type(Stats.instance)` → `NoStatsLogger`.
   5. Manually run, in the same process:
      ```python
      from airflow.sdk._shared.observability.metrics import stats
      from airflow.sdk.observability.metrics import stats_utils
      from airflow.configuration import conf
      stats.initialize(
          factory=stats_utils.get_stats_factory(),
          export_legacy_names=conf.getboolean("metrics", "legacy_names_on"),
      )
      ```
      → on the next heartbeat all `edge_worker.*` series appear (verified in a 
live 3.2.2 deployment: 200+ series restored, correctly tagged with 
`worker_name`).
   
   ### Proposed fix
   
   Initialize the Task SDK `Stats` singleton in the API server's FastAPI 
`lifespan` (runs once per worker, post-fork), mirroring the existing init in 
`serde` / `task_runner`. PR to follow.
   
   ### Relationship to #67328
   
   #67328 (*Bring back edge worker metric compatibility with Airflow 3.2*) is 
**complementary, not a duplicate**: it makes `edge3` dual-emit the legacy 
dotted form so old StatsD mappings match again (a tag/naming concern). It does 
**not** initialize the `Stats` singleton — empirically, 
`DualStatsManager.gauge(...)` without `Stats.initialize()` is also dropped. 
This issue is the root-cause init gap.
   
   ### Side note (separate, benign)
   
   Once Stats is initialized, the SDK `statsd_logger` logs `Dropping invalid 
tag: queues=a,b,c` for the `edge_worker.num_queues` gauge: `set_metrics` builds 
a `queues` tag whose value is a comma-joined list, and commas are the InfluxDB 
tag delimiter, so that single tag is dropped (the metric value and 
`worker_name` still flow). Worth fixing in `edge3` separately.
   
   ### Operating System / Deployment
   
   Kubernetes, Airflow 3.2.2, `edge3` provider 3.7.0, `statsd_exporter` v0.28.0.
   
   ### Are you willing to submit PR?
   
   Yes — PR to follow.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to