ferruzzi commented on issue #68294:
URL: https://github.com/apache/airflow/issues/68294#issuecomment-4725975650

   For posterity: In order to check if I got them all, I made a list of the 
ones I have included in these 7 PRs then pointed Claude Opus 4.8 at the 
codebase with the following prompt:
   
   ```
   A metric can be tied to a team if:
   - one of the following is available at the emit site: dag_id, bundle_name, 
pool, team-scoped executor, team-scoped triggerer, connection-test row, 
TaskInstance / RuntimeTaskInstance, or  DagRun context 
   - AND it is NOT a shared-infrastructure or cross-team-aggregate metric 
because those cannot be attributed to any single team even when a team identity 
is technically nearby.
   
   Metrics which do not have emit sites, metrics which only exist in providers, 
and metrics which can not be tied to a team are considered out of scope for 
this discussion. 
   
   Metrics which are included in the PR list below are considered done and 
should not be included in the list of remaining work. 
   
   1. Read the list of metrics from the metrics registry at 
`shared/observability/src/airflow_shared/observability/metrics/metrics_template.yaml`
   2. For each metric in the registry, locate and read its emit site and 
determine if it is in scope
   3. Generate a report which contains three lists: 
     - 1) Done: Metrics which are included in the PR list below, or whose 
emitted tags include team_name either directly or via an indirect source such 
as stats_tags/metric_tags
     - 2) Remaining work: Metrics which can be linked to a Team and do not 
include the team_name tag.
     - 3) Out of Scope: Metrics which are out of scope, sorted by the reason 
(legacy, provider, or non-team)
   4. Finally, cross-check those lists to ensure that every metric you found 
exists on one and only one of those lists.
   
   Related PRs:
   - 68108
   - 68367
   - 68589
   - 68593
   - 68594
   - 68599
   - 68601
   ``` 
   
   and this was the generated report:
   
   
   # Multi-Team Metrics: Final Count
   
   Categorization of all 125 metrics in the registry (`metrics_template.yaml`), 
based on the 7 related PRs and a trace of every emit site.
   
   ## How the 7 PRs map
   
   The PRs explicitly tag 47 metrics.  A further 9 are already covered 
indirectly
   because they emit through `DagRun.stats_tags` / `TaskInstance.stats_tags` /
   `RuntimeTaskInstance.stats_tags`, all of which were given `team_name` in 
68108.
   So 56 are Done.
   
   ## 1) Done (56)
   
   Directly tagged by a listed PR:
   
   - 68108 (DagRun/TI stats_tags + pools + SDK): `pool.open_slots`, 
`pool.queued_slots`, `pool.running_slots`, `pool.deferred_slots`, 
`pool.scheduled_slots`, `pool.starving_tasks`, `ti.start`, 
`operator_successes`, `operator_failures`, `ti_successes`, `ti_failures`, 
`task.duration`
   - 68367 (assets): `asset.updates`, `asset.triggered_dagruns`
   - 68589 (deadlines): `deadline_alerts.deadline_created`, 
`deadline_alerts.deadline_missed`, `deadline_alerts.deadline_not_missed`
   - 68593 (executors): `executor.open_slots`, `executor.queued_tasks`, 
`executor.running_tasks`
   - 68594 (scheduler): `scheduler.tasks.killed_externally`, 
`dagrun.schedule_delay`, `dagrun.duration.failed`, `ti.scheduled`, `ti.queued`, 
`ti.running`, `ti.deferred`, `task_instances_without_heartbeats_killed`
   - 68599 (dag processor): `dag_processing.other_callback_count`, 
`dag_processing.last_run.seconds_ago`, `dag_processing.processes`, 
`dag_processing.processor_timeouts`, `dag_processing.callback_only_count`, 
`dag_processing.last_duration`, `dag.callback_exceptions`
   - 68601 (stragglers): `triggerer_heartbeat`, `triggers.succeeded`, 
`triggers.failed`, `triggers.running`, `triggerer.capacity_left`, 
`task.scheduled_duration`, `task.queued_duration`, 
`resumable_job.fresh_submit`, `resumable_job.already_succeeded`, 
`resumable_job.terminal_resubmit`, `resumable_job.reconnect_attempt`, 
`resumable_job.reconnect_success`
   
   Done indirectly (emit through a stats_tags source that now carries 
`team_name`):
   
   - `ti.finish` (task-sdk `run()`, shares the `ti.stats_tags` variable)
   - `previously_succeeded` (`taskinstance.py`, `ti.stats_tags`)
   - `task_removed_from_dag`, `task_restored_to_dag`, `task_instance_created` 
(`dagrun.py`, `self.stats_tags`)
   - `dagrun.dependency-check`, `dagrun.first_task_scheduling_delay`, 
`dagrun.first_task_start_delay`, `dagrun.duration.success` (`dagrun.py`, 
`self.stats_tags`)
   
   ## 2) Remaining work (6)
   
   Team-attributable, emit site has a team identity, but no `team_name` tag yet:
   
   - `connection_test.reaped` (scheduler reaper; `ct.team_name` is literally in 
scope and logged one line above the `stats.incr`, only `prior_state` is tagged)
   - `connection_test.success` (worker connection-test supervisor; per 
connection-test row)
   - `connection_test.failed` (same)
   - `connection_test.hook_duration` (same)
   - `scheduler.executor_heartbeat_duration` (loops over `self.executors`; each 
`executor.team_name` is available, tagged only by executor class)
   - `triggers.blocked_main_thread` (emitted in a team-scoped triggerer's 
`TriggerRunner`; `team_name` lives on the runner/supervisor but is not plumbed 
to this emit)
   
   Two of these are more borderline than the others: 
`scheduler.executor_heartbeat_duration`
   and `triggers.blocked_main_thread` are component-level (team-scoped executor 
/
   team-scoped triggerer).  They satisfy the signal list, but if those 
components are
   considered "shared," they'd drop to non-team.  The four `connection_test.*` 
rows are
   clean Remaining items per the explicit "connection-test row" signal.
   
   ## 3) Out of Scope
   
   Sorted by reason.
   
   ### Legacy / no emit site (6)
   
   - `local_task_job.task_exit` (only in the validator regex; no emit)
   - `dag_file_processor_timeouts` (marked DEPRECATED; no emit)
   - `dag_processing.manager_stalls` (no emit)
   - `dag_file_refresh_error` (no emit)
   - `dag_processing.last_num_of_db_queries.{dag_file}` (stored on 
`DagFileStat`, never emitted)
   - `collect_db_dags` (no emit anywhere)
   
   ### Provider-only (28)
   
   - celery: `celery.task_timeout_error`, `celery.execute_command.failure`
   - openlineage: `ol.emit.failed`, `ol.event.size`, `ol.emit.attempts`, 
`ol.extract`
   - edge3 worker: `edge_worker.status`, `edge_worker.connected`, 
`edge_worker.maintenance`, `edge_worker.jobs_active`, 
`edge_worker.concurrency`, `edge_worker.free_concurrency`, 
`edge_worker.num_queues`, `edge_worker.heartbeat_count`, 
`edge_worker.ti.start`, `edge_worker.ti.finish`
   - edge3 executor: `edge_executor.sync.duration`
   - cncf.kubernetes: `kubernetes_executor.pod_creation_status`, 
`kubernetes_executor.pod_deletion_status`, 
`kubernetes_executor.pod_patching_status`, 
`kubernetes_executor.clear_not_launched_queued_tasks.duration`, 
`kubernetes_executor.adopt_task_instances.duration`, 
`kubernetes_executor.pod_creation`, `kubernetes_executor.pod_deletion`, 
`kubernetes_executor.pod_patching`
   - amazon: `batch_executor.adopt_task_instances.duration`, 
`ecs_executor.adopt_task_instances.duration`, 
`lambda_executor.adopt_task_instances.duration`
   
   ### Non-team (shared infrastructure or cross-team aggregate) (29)
   
   - Job/component lifecycle: `{job_name}_start`, `{job_name}_end`, 
`{job_name}_heartbeat_failure`, `scheduler_heartbeat`, `dag_processor_heartbeat`
   - Scheduler aggregates/infra: `scheduler.orphaned_tasks.cleared`, 
`scheduler.orphaned_tasks.adopted`, `scheduler.critical_section_busy`, 
`scheduler.tasks.starving`, `scheduler.tasks.executable`, 
`scheduler.dagruns.running`, `scheduler.critical_section_duration`, 
`scheduler.critical_section_query_duration`, `scheduler.scheduler_loop_duration`
   - Dag-processing aggregates: `dag_processing.file_path_queue_update_count`, 
`dag_processing.import_errors`, `dag_processing.total_parse_time`, 
`dag_processing.file_path_queue_size`, `dagbag_size`
   - Assets aggregate: `asset.orphaned`
   - API-server shared cache: `api_server.dag_bag.cache_hit`, 
`api_server.dag_bag.cache_miss`, `api_server.dag_bag.cache_clear`, 
`api_server.dag_bag.cache_size`
   - Connection-test aggregates: `connection_test.active`, 
`connection_test.pending`, `connection_test.dispatch_duration`
   - SDK startup infra: `airflow.io.load_filesystems`, `serde.load_serializers`
   
   ## 4) Cross-check
   
   - Done 56 + Remaining 6 + Legacy 6 + Provider 28 + Non-team 29 = 125, which 
equals the registry count.
   - Each metric appears on exactly one list; no metric is on two lists, and 
none is unaccounted for.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to