Re: [PR] fix(reports): reliability fixes for alert/report execution [superset]

via GitHub Thu, 25 Jun 2026 14:49:05 -0700


aminghadersohi commented on code in PR #41177:
URL: https://github.com/apache/superset/pull/41177#discussion_r3477885585



##########
superset/reports/notifications/webhook.py:
##########
@@ -128,7 +128,21 @@ def _validate_webhook_url(self, url: str) -> None:
             raise NotificationParamException("Webhook URL target host is not 
allowed.")
 
     @backoff.on_exception(
-        backoff.expo, NotificationUnprocessableException, factor=10, base=2, 
max_tries=5
+        backoff.expo,
+        NotificationUnprocessableException,
+        factor=10,
+        base=2,
+        max_tries=5,
+        # Bound total wall-clock retry time. Without this, a hanging or
+        # persistently failing target can stall a worker for minutes per bad 
URL
+        # (up to ~5 socket waits at timeout=60 plus ~150s of retry sleeps),
+        # starving sequential report dispatch. backoff evaluates max_time 
against
+        # the elapsed time sampled BEFORE each attempt, so the last request can
+        # start just under the bound and still run its full timeout: worst case
+        # ~210s (~150s elapsed at the final pre-attempt check + one 60s 
request),

Review Comment:
   NIT: the comment says backoff samples elapsed "BEFORE each attempt" and 
gives "~210s (≈150s elapsed at the final pre-attempt check + one 60s request)". 
`backoff._sync` actually samples elapsed *after* each attempt fails and before 
sleeping — not before the next attempt starts. The arithmetic is also 
internally inconsistent: 150s elapsed > `max_time=120` would already trigger 
giveup without starting any further request. The actual ceiling 
(just-under-120s accumulated + one final 60s request) is closer to ~180s, which 
is consistent with the PR body's own wording. The production behaviour is 
correct; the comment's framing of *when* the check fires and the resulting 
estimate are misleading.



##########
tests/unit_tests/reports/notifications/webhook_tests.py:
##########
@@ -269,3 +271,156 @@ class MockResponse:
 
     with pytest.raises(NotificationParamException, match="redirect"):
         webhook_notification.send()
+
+
+def _make_webhook(mock_header_data) -> WebhookNotification:
+    from superset.reports.models import ReportRecipients, ReportRecipientType
+    from superset.reports.notifications.base import NotificationContent
+
+    content = NotificationContent(
+        name="test alert", header_data=mock_header_data, description="Test 
description"
+    )
+    return WebhookNotification(
+        recipient=ReportRecipients(
+            type=ReportRecipientType.WEBHOOK,
+            recipient_config_json='{"target": "https://example.com/webhook"}',
+        ),
+        content=content,
+    )
+
+
+class _MockServerErrorResponse:
+    status_code = 500
+    text = ""
+
+
+def _allow_internal_app() -> type:
+    class MockCurrentApp:
+        config = {
+            "ALERT_REPORTS_WEBHOOK_HTTPS_ONLY": True,
+            "ALERT_REPORTS_WEBHOOK_ALLOW_INTERNAL_HOSTS": True,
+        }
+
+    return MockCurrentApp
+
+
+def test_send_backoff_bounded_by_max_time(monkeypatch, mock_header_data) -> 
None:
+    """
+    A persistently failing (500) target gives up on wall-time (``max_time``),
+    not just ``max_tries``. ``freeze_time(auto_tick_seconds=30)`` advances the
+    clock on every time access (backoff measures elapsed via ``datetime.now``,
+    which freezegun freezes and ticks), so cumulative elapsed crosses 
``max_time=120``
+    before ``max_tries=5`` is exhausted. We assert the discriminating 
*property*
+    — gave up on wall-time strictly before exhausting ``max_tries`` — rather 
than
+    a pinned count, because freezegun's per-call tick count is an opaque
+    implementation detail. If ``max_time`` is removed this rises to the full 5
+    (RED); the flat-clock companion test below anchors that 5 is the ceiling.
+    The terminal exception type is unchanged on giveup.
+    """
+    webhook_notification = _make_webhook(mock_header_data)
+    post_calls: list[int] = []
+
+    def fake_post(*args, **kwargs) -> _MockServerErrorResponse:
+        post_calls.append(1)
+        return _MockServerErrorResponse()
+
+    monkeypatch.setattr(
+        "superset.reports.notifications.webhook.current_app", 
_allow_internal_app()
+    )
+    monkeypatch.setattr(
+        
"superset.reports.notifications.webhook.feature_flag_manager.is_feature_enabled",
+        lambda flag: True,
+    )
+    monkeypatch.setattr(
+        "superset.reports.notifications.webhook.requests.post", fake_post
+    )
+    monkeypatch.setattr("backoff._sync.time.sleep", lambda *a, **k: None)

Review Comment:
   MEDIUM: all three new tests patch `backoff._sync.time.sleep` (this line, 
372, 416) — a private module attribute of `backoff`, the same coupling class as 
the `_FakeBackoffDatetime` pattern an earlier reviewer flagged (now addressed 
for clock-faking via freezegun). Unlike the old datetime approach, 
`monkeypatch.setattr` here would fail loudly with `AttributeError` if backoff 
reorganises, so it is not a silent-pass risk — but it still binds the tests to 
backoff's internal module layout.
   
   Suggest patching at the stdlib level instead, consistent with the freezegun 
approach now used for clock-faking:
   ```suggestion
       monkeypatch.setattr(time.sleep, lambda *a, **kw: None)
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] fix(reports): reliability fixes for alert/report execution [superset]

Reply via email to