Re: [PR] fix(reports): add network timeouts so schedules can't hang forever [superset]

via GitHub Tue, 23 Jun 2026 06:04:49 -0700


codeant-ai-for-open-source[bot] commented on code in PR #41250:
URL: https://github.com/apache/superset/pull/41250#discussion_r3459935671



##########
superset/utils/core.py:
##########
@@ -948,9 +953,11 @@ def send_mime_email(
     # root CA certificates
     ssl_context = ssl.create_default_context() if smtp_ssl_server_auth else 
None
     smtp = (
-        smtplib.SMTP_SSL(smtp_host, smtp_port, context=ssl_context)
+        smtplib.SMTP_SSL(
+            smtp_host, smtp_port, context=ssl_context, timeout=smtp_timeout
+        )
         if smtp_ssl
-        else smtplib.SMTP(smtp_host, smtp_port)
+        else smtplib.SMTP(smtp_host, smtp_port, timeout=smtp_timeout)
     )

Review Comment:
   **Suggestion:** The SMTP connection is not protected by a `try/finally` (or 
context manager), so when the new timeout causes `starttls`, `login`, or 
`sendmail` to raise, `quit()` is skipped and sockets can accumulate in the 
worker process. Wrap the SMTP session lifecycle so cleanup always runs on error 
paths. [missing cleanup]
   
   <details>
   <summary><b>Severity Level:</b> Major ⚠️</summary>
   
   ```mdx
   - ❌ Repeated SMTP timeouts leak sockets in Celery worker process.
   - ⚠️ Alert and report email failures increase worker resource usage.
   - ⚠️ Dashboard deactivation emails may leak SMTP connections.
   ```
   </details>
   <details>
   <summary><b>Steps of Reproduction ✅ </b></summary>
   
   ```mdx
   1. Configure an EMAIL report recipient so that the scheduler uses the report 
execution
   flow in `superset/commands/report/execute.py:_send` (lines 740–799), which 
iterates
   `recipients` and calls `notification.send()` for each via 
`create_notification(...)` from
   `superset/reports/notifications/__init__.py:25-31`.
   
   2. Ensure the recipient type is EMAIL so `create_notification` instantiates
   `EmailNotification` from `superset/reports/notifications/email.py:80-86`; 
its `send()`
   method at lines 33-56 calls `send_email_smtp(...)` from 
`superset/utils/core.py` with
   `current_app.config` and `dryrun=False`.
   
   3. Inside `superset/utils/core.py:send_email_smtp` (lines 829-879), the 
function builds a
   `MIMEMultipart` and calls `send_mime_email(smtp_mail_from, recipients, msg, 
config,
   dryrun=dryrun)` at line 5 of the earlier snippet, which enters 
`send_mime_email` defined
   at `superset/utils/core.py:927-49;` there the SMTP client is created at 
lines 955-961
   using `smtplib.SMTP_SSL(..., timeout=smtp_timeout)` or `smtplib.SMTP(...,
   timeout=smtp_timeout)`, and only after `smtp.starttls()`, `smtp.login()`, and
   `smtp.sendmail()` does it call `smtp.quit()` at line 49.
   
   4. Point `SMTP_HOST`/`SMTP_PORT` in `current_app.config` to an SMTP endpoint 
that accepts
   TCP but hangs longer than `SMTP_TIMEOUT` during `starttls`, `login`, or 
`sendmail`,
   causing those calls in `send_mime_email` (lines 43-48) to raise (e.g. 
`socket.timeout`);
   because there is no `try/finally` or `with smtplib.SMTP(...)` context 
manager anywhere in
   the repo (confirmed by a `Grep` search for `with smtplib.SMTP` returning no 
matches), the
   exception unwinds out of `send_mime_email` before reaching `smtp.quit()`, so 
the SMTP
   socket is not explicitly closed and can accumulate across repeated failing 
sends in the
   long-lived worker process.
   ```
   </details>
   
   [![Fix in 
Cursor](https://new-codeant-butcket.s3.us-west-1.amazonaws.com/badges/fix-in-cursor-flat.svg)](https://app.codeant.ai/fix-in-ide?tool=cursor&prompt_id=d355d61a579d412c84559a6b71a86511&service=github&base_url=https%3A%2F%2Fgithub.com&org=apache&repo=apache%2Fsuperset)
 [![Fix in VSCode 
Claude](https://new-codeant-butcket.s3.us-west-1.amazonaws.com/badges/fix-in-vscode-claude-flat.svg)](https://app.codeant.ai/fix-in-ide?tool=vscode-claude&prompt_id=d355d61a579d412c84559a6b71a86511&service=github&base_url=https%3A%2F%2Fgithub.com&org=apache&repo=apache%2Fsuperset)
   
   *(Use Cmd/Ctrl + Click for best experience)*
   <details>
   <summary><b>Prompt for AI Agent 🤖 </b></summary>
   
   ```mdx
   This is a comment left during a code review.
   
   **Path:** superset/utils/core.py
   **Line:** 955:961
   **Comment:**
        *Missing Cleanup: The SMTP connection is not protected by a 
`try/finally` (or context manager), so when the new timeout causes `starttls`, 
`login`, or `sendmail` to raise, `quit()` is skipped and sockets can accumulate 
in the worker process. Wrap the SMTP session lifecycle so cleanup always runs 
on error paths.
   
   Validate the correctness of the flagged issue. If correct, How can I resolve 
this? If you propose a fix, implement it and please make it concise.
   Once fix is implemented, also check other comments on the same PR, and ask 
user if the user wants to fix the rest of the comments as well. if said yes, 
then fetch all the comments validate the correctness and implement a minimal fix
   ```
   </details>
   <a 
href='https://app.codeant.ai/feedback?pr_url=https%3A%2F%2Fgithub.com%2Fapache%2Fsuperset%2Fpull%2F41250&comment_hash=3e1ec04d4520acc344ad3082ef1c5ceb852af8b894ed4b9cd4e4e8bdfef10f8f&reaction=like'>👍</a>
 | <a 
href='https://app.codeant.ai/feedback?pr_url=https%3A%2F%2Fgithub.com%2Fapache%2Fsuperset%2Fpull%2F41250&comment_hash=3e1ec04d4520acc344ad3082ef1c5ceb852af8b894ed4b9cd4e4e8bdfef10f8f&reaction=dislike'>👎</a>



##########
superset/utils/webdriver.py:
##########
@@ -462,6 +462,11 @@ def driver(self) -> WebDriver:
             if not self._driver:
                 raise RuntimeError("WebDriver creation failed")
             self._driver.set_window_size(*self._window)
+            # Bound driver.get() so an unreachable page raises a 
TimeoutException
+            # instead of blocking the worker (and the report schedule) forever.
+            page_load_wait = app.config["SCREENSHOT_PAGE_LOAD_WAIT"]
+            if page_load_wait is not None:
+                self._driver.set_page_load_timeout(page_load_wait)

Review Comment:
   **Suggestion:** The new page-load timeout can now raise during driver 
initialization, but there is no rollback of the already-assigned driver 
instance. If an exception is raised here (or in the immediate auth step that 
now uses this timeout), `_driver` stays cached in a partially 
initialized/unauthenticated state, and later calls will reuse it instead of 
recreating and re-authenticating. Wrap initialization/auth timeout setup in 
failure cleanup that resets/destroys `_driver` before re-raising. [stale 
reference]
   
   <details>
   <summary><b>Severity Level:</b> Major ⚠️</summary>
   
   ```mdx
   - ⚠️ Cache warmup can reuse unauthenticated Selenium drivers.
   - ⚠️ Page-load timeout may remain unset after initialization failure.
   - ⚠️ Subsequent warmup URLs may behave inconsistently or hang.
   ```
   </details>
   <details>
   <summary><b>Steps of Reproduction ✅ </b></summary>
   
   ```mdx
   1. In `superset/tasks/cache.py:256-268` the cache warmup task constructs a 
long-lived
   `WebDriverSelenium` instance via `wd =
   WebDriverSelenium(current_app.config["WEBDRIVER_TYPE"], user=user)` and 
reuses it inside a
   loop, only calling `wd.destroy()` once in the outer `finally` block after 
all URLs are
   processed.
   
   2. For each URL, `wd.get_screenshot(url, "grid-container")` (defined in
   `superset/utils/webdriver.py:687-745`) calls `driver = self.driver`, which 
executes the
   `driver` property initialization logic at 
`superset/utils/webdriver.py:109-120`, including
   `_create()`, `set_window_size`, the new `set_page_load_timeout`, and 
`_auth(self._user)`.
   
   3. If any of these initialization steps after `_create()` raise (for example,
   `self._driver.set_page_load_timeout(page_load_wait)` at
   `superset/utils/webdriver.py:118-120` or `_auth(self._user)` at
   `superset/utils/webdriver.py:20-26` raising a `WebDriverException` or 
another error), the
   exception propagates back to `wd.get_screenshot`, is caught by the inner 
`except
   Exception` in `cache.py:259-266`, and the loop continues, but `_driver` 
remains set to the
   partially initialized instance because there is no cleanup in the `driver` 
property.
   
   4. On the next iteration of the warmup loop, `wd.get_screenshot()` calls 
`self.driver`
   again; since `self._driver` is now truthy, the property returns the cached 
instance
   without re-running `set_window_size`, `set_page_load_timeout`, or `_auth`, 
so subsequent
   screenshot calls reuse a possibly unauthenticated or misconfigured driver 
until
   `wd.destroy()` eventually runs at the end of the task, confirming the 
suggestion that
   initialization-time failures can leave a stale cached driver.
   ```
   </details>
   
   [![Fix in 
Cursor](https://new-codeant-butcket.s3.us-west-1.amazonaws.com/badges/fix-in-cursor-flat.svg)](https://app.codeant.ai/fix-in-ide?tool=cursor&prompt_id=1ef71d12b033431fa6b52a1d139b0599&service=github&base_url=https%3A%2F%2Fgithub.com&org=apache&repo=apache%2Fsuperset)
 [![Fix in VSCode 
Claude](https://new-codeant-butcket.s3.us-west-1.amazonaws.com/badges/fix-in-vscode-claude-flat.svg)](https://app.codeant.ai/fix-in-ide?tool=vscode-claude&prompt_id=1ef71d12b033431fa6b52a1d139b0599&service=github&base_url=https%3A%2F%2Fgithub.com&org=apache&repo=apache%2Fsuperset)
   
   *(Use Cmd/Ctrl + Click for best experience)*
   <details>
   <summary><b>Prompt for AI Agent 🤖 </b></summary>
   
   ```mdx
   This is a comment left during a code review.
   
   **Path:** superset/utils/webdriver.py
   **Line:** 465:469
   **Comment:**
        *Stale Reference: The new page-load timeout can now raise during driver 
initialization, but there is no rollback of the already-assigned driver 
instance. If an exception is raised here (or in the immediate auth step that 
now uses this timeout), `_driver` stays cached in a partially 
initialized/unauthenticated state, and later calls will reuse it instead of 
recreating and re-authenticating. Wrap initialization/auth timeout setup in 
failure cleanup that resets/destroys `_driver` before re-raising.
   
   Validate the correctness of the flagged issue. If correct, How can I resolve 
this? If you propose a fix, implement it and please make it concise.
   Once fix is implemented, also check other comments on the same PR, and ask 
user if the user wants to fix the rest of the comments as well. if said yes, 
then fetch all the comments validate the correctness and implement a minimal fix
   ```
   </details>
   <a 
href='https://app.codeant.ai/feedback?pr_url=https%3A%2F%2Fgithub.com%2Fapache%2Fsuperset%2Fpull%2F41250&comment_hash=f8f2dedb4058160018271006198995a078de7fc16047676996d2f37051e9be4c&reaction=like'>👍</a>
 | <a 
href='https://app.codeant.ai/feedback?pr_url=https%3A%2F%2Fgithub.com%2Fapache%2Fsuperset%2Fpull%2F41250&comment_hash=f8f2dedb4058160018271006198995a078de7fc16047676996d2f37051e9be4c&reaction=dislike'>👎</a>



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] fix(reports): add network timeouts so schedules can't hang forever [superset]

Reply via email to