Re: [PR] fix(reports): reliability fixes for alert/report execution [superset]

via GitHub Fri, 26 Jun 2026 10:12:41 -0700


sadpandajoe commented on code in PR #41177:
URL: https://github.com/apache/superset/pull/41177#discussion_r3482965288



##########
superset/reports/notifications/webhook.py:
##########
@@ -128,7 +128,21 @@ def _validate_webhook_url(self, url: str) -> None:
             raise NotificationParamException("Webhook URL target host is not 
allowed.")
 
     @backoff.on_exception(
-        backoff.expo, NotificationUnprocessableException, factor=10, base=2, 
max_tries=5
+        backoff.expo,
+        NotificationUnprocessableException,
+        factor=10,
+        base=2,
+        max_tries=5,
+        # Bound total wall-clock retry time. Without this, a hanging or
+        # persistently failing target can stall a worker for minutes per bad 
URL
+        # (up to ~5 socket waits at timeout=60 plus ~150s of retry sleeps),
+        # starving sequential report dispatch. backoff evaluates max_time 
against
+        # the elapsed time sampled BEFORE each attempt, so the last request can
+        # start just under the bound and still run its full timeout: worst case
+        # ~210s (~150s elapsed at the final pre-attempt check + one 60s 
request),

Review Comment:
   Good catch — reworded the comment. You're right that the original framing of 
*when* the check fires was misleading.
   
   To reconcile the number, though: in `backoff._sync.retry_exception`, 
`elapsed` is sampled at the *top* of each loop iteration (before the attempt), 
and the `max_time` check runs in the `except` block against that same 
pre-attempt value. So:
   
   - after attempt 2 fails, the check sees `elapsed` ≈ 60–70s (< 120) → it 
proceeds to attempt 3;
   - attempt 3 starts at ~120–150s and runs its full 60s timeout; only *after* 
it fails does the check (`elapsed` ≥ 120) stop the loop.
   
   That's 3 full-timeout attempts plus the two jitter sleeps: total = 180 + s1 
+ s2 with s1∈[0,10], s2∈[0,20] → **180–210s**. So ~180s is the zero-jitter 
floor and ~210s the max-jitter ceiling — the same scenario. I've reworded to 
state the range and the actual mechanism rather than the "pre-attempt check" 
phrasing. Production behaviour is unchanged either way.



##########
tests/unit_tests/reports/notifications/webhook_tests.py:
##########
@@ -269,3 +271,156 @@ class MockResponse:
 
     with pytest.raises(NotificationParamException, match="redirect"):
         webhook_notification.send()
+
+
+def _make_webhook(mock_header_data) -> WebhookNotification:
+    from superset.reports.models import ReportRecipients, ReportRecipientType
+    from superset.reports.notifications.base import NotificationContent
+
+    content = NotificationContent(
+        name="test alert", header_data=mock_header_data, description="Test 
description"
+    )
+    return WebhookNotification(
+        recipient=ReportRecipients(
+            type=ReportRecipientType.WEBHOOK,
+            recipient_config_json='{"target": "https://example.com/webhook"}',
+        ),
+        content=content,
+    )
+
+
+class _MockServerErrorResponse:
+    status_code = 500
+    text = ""
+
+
+def _allow_internal_app() -> type:
+    class MockCurrentApp:
+        config = {
+            "ALERT_REPORTS_WEBHOOK_HTTPS_ONLY": True,
+            "ALERT_REPORTS_WEBHOOK_ALLOW_INTERNAL_HOSTS": True,
+        }
+
+    return MockCurrentApp
+
+
+def test_send_backoff_bounded_by_max_time(monkeypatch, mock_header_data) -> 
None:
+    """
+    A persistently failing (500) target gives up on wall-time (``max_time``),
+    not just ``max_tries``. ``freeze_time(auto_tick_seconds=30)`` advances the
+    clock on every time access (backoff measures elapsed via ``datetime.now``,
+    which freezegun freezes and ticks), so cumulative elapsed crosses 
``max_time=120``
+    before ``max_tries=5`` is exhausted. We assert the discriminating 
*property*
+    — gave up on wall-time strictly before exhausting ``max_tries`` — rather 
than
+    a pinned count, because freezegun's per-call tick count is an opaque
+    implementation detail. If ``max_time`` is removed this rises to the full 5
+    (RED); the flat-clock companion test below anchors that 5 is the ceiling.
+    The terminal exception type is unchanged on giveup.
+    """
+    webhook_notification = _make_webhook(mock_header_data)
+    post_calls: list[int] = []
+
+    def fake_post(*args, **kwargs) -> _MockServerErrorResponse:
+        post_calls.append(1)
+        return _MockServerErrorResponse()
+
+    monkeypatch.setattr(
+        "superset.reports.notifications.webhook.current_app", 
_allow_internal_app()
+    )
+    monkeypatch.setattr(
+        
"superset.reports.notifications.webhook.feature_flag_manager.is_feature_enabled",
+        lambda flag: True,
+    )
+    monkeypatch.setattr(
+        "superset.reports.notifications.webhook.requests.post", fake_post
+    )
+    monkeypatch.setattr("backoff._sync.time.sleep", lambda *a, **k: None)

Review Comment:
   Done — switched all three sites to `monkeypatch.setattr("time.sleep", …)`, 
which decouples the tests from backoff's internal module layout as you 
suggested.
   
   (Used the string-target form: backoff resolves `time.sleep` at call time, so 
it picks up the stdlib patch; the bare `monkeypatch.setattr(time.sleep, …)` 
form needs the `(target, name, value)` argument shape.)



##########
superset/commands/report/execute.py:
##########
@@ -1176,6 +1207,18 @@ def run(self) -> None:
             if not self._model:
                 raise ReportScheduleExecuteUnexpectedError()
 
+            # Resolve the executor at the run() boundary the same way master
+            # does: tolerate a missing user (find_user -> None) so the state
+            # machine still runs and its error envelope writes the ERROR
+            # execution-log row and sends the owner notification. The dedicated
+            # ReportScheduleExecutorNotFoundError guard lives at the content
+            # sites (_get_screenshots / _get_csv_data / _get_embedded_data),
+            # which raise inside that envelope. Guarding here instead would
+            # surface the executor error above the state machine, suppressing
+            # both the log row and the owner notification. The alert-query path
+            # (AlertCommand) is intentionally left on master behavior — a
+            # missing executor there surfaces as a query error, not the
+            # dedicated executor error; tightening it is out of scope here.

Review Comment:
   Reworded to drop the branch-name references — the comment now describes the 
rationale without tying it to a branch.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] fix(reports): reliability fixes for alert/report execution [superset]

Reply via email to