henry3260 commented on PR #67549: URL: https://github.com/apache/airflow/pull/67549#issuecomment-4553244404
> Thanks for the review @henry3260! > > Re: _misleading percentages when capped_ — The per-state counts are still capped at 1000, and the frontend already shows "N+" when a count reaches that cap (see `state_count_limit` in the response). The uncapped total just provides a better denominator so that non-capped states get accurate percentages. Capped states will show an understated percentage, but the "1000+" indicator signals the uncertainty. This is strictly better than the current behavior where all percentages are wrong (even for non-capped states) because the denominator itself is wrong. > > Re: _expensive COUNT(_) queries* — Both `dag_run_total_count` and `task_instance_total_count` are simple `COUNT(*)` queries that hit the same indexed filters (`dag_id`, `start_date`, `end_date`) already used by the capped counting subqueries. In practice these are index-only scans that execute in single-digit milliseconds, adding negligible overhead to the overall endpoint. > > An alternative would be to remove the caps entirely, but that was intentionally avoided to keep the endpoint fast for deployments with millions of task instances. The current approach is a minimal, pragmatic improvement. IMO, we cannot assume these queries are cheap just because they are simple `COUNT(*)` queries. Am I missing something, or do we not have a composite index that matches this filter shape, such as `(dag_id, start_date, end_date)`? My concern is that the existing capped queries are bounded by `STATE_COUNT_CAP`, while these new total count queries need to count every matching Dag run / task instance. For example, if there are 100000 matching Dag runs in the selected time range, and each Dag run has 50 task instances, then `ti_total` may need to count 5000000 task instance rows. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
