codeant-ai-for-open-source[bot] commented on code in PR #38576:
URL: https://github.com/apache/superset/pull/38576#discussion_r3482298974
##########
superset/utils/screenshots.py:
##########
@@ -280,43 +281,66 @@ def compute_and_cache( # pylint:
disable=too-many-arguments
:return: Image payload
"""
cache_key = cache_key or self.get_cache_key(window_size, thumb_size)
- cache_payload = self.get_from_cache_key(cache_key) or
ScreenshotCachePayload()
- if not cache_payload.should_trigger_task(force=force):
- logger.info(
- "Skipping compute - already processed for thumbnail: %s",
cache_key
- )
- return
-
- window_size = window_size or self.window_size
- thumb_size = thumb_size or self.thumb_size
- logger.info("Processing url for thumbnail: %s", cache_key)
- cache_payload.computing()
- image = None
- # Assuming all sorts of things can go wrong with Selenium
try:
- logger.info("trying to generate screenshot")
- with
event_logger.log_context(f"screenshot.compute.{self.thumbnail_type}"):
- image = self.get_screenshot(user=user, window_size=window_size)
- except Exception as ex: # pylint: disable=broad-except
- logger.warning("Failed at generating thumbnail %s", ex,
exc_info=True)
- cache_payload.error()
- if image and window_size != thumb_size:
- try:
- image = self.resize_image(image, thumb_size=thumb_size)
- except Exception as ex: # pylint: disable=broad-except
- logger.warning("Failed at resizing thumbnail %s", ex,
exc_info=True)
- cache_payload.error()
+ with DistributedLock(
+ namespace="thumbnail",
+ key=cache_key,
+ ttl_seconds=app.config["THUMBNAIL_COMPUTING_CACHE_TTL"],
+ ):
+ cache_payload = (
+ self.get_from_cache_key(cache_key) or
ScreenshotCachePayload()
+ )
+ if not cache_payload.should_trigger_task(force=force):
+ logger.info(
+ "Skipping compute - already processed for thumbnail:
%s",
+ cache_key,
+ )
+ return
+
+ window_size = window_size or self.window_size
+ thumb_size = thumb_size or self.thumb_size
+ logger.info("Processing url for thumbnail: %s", cache_key)
+ cache_payload.computing()
+ self.cache.set(cache_key, cache_payload.to_dict())
image = None
-
- # Cache the result (success or error) to avoid immediate retries
- if image:
- with
event_logger.log_context(f"screenshot.cache.{self.thumbnail_type}"):
- cache_payload.update(image)
-
- logger.info("Caching thumbnail: %s", cache_key)
- self.cache.set(cache_key, cache_payload.to_dict())
- logger.info("Updated thumbnail cache; Status: %s",
cache_payload.get_status())
- return
+ # Assuming all sorts of things can go wrong with Selenium
+ try:
+ logger.info("trying to generate screenshot")
+ with event_logger.log_context(
+ f"screenshot.compute.{self.thumbnail_type}"
+ ):
+ image = self.get_screenshot(user=user,
window_size=window_size)
+ except Exception as ex: # pylint: disable=broad-except
+ logger.warning(
+ "Failed at generating thumbnail %s", ex, exc_info=True
+ )
+ cache_payload.error()
+ if image and window_size != thumb_size:
+ try:
+ image = self.resize_image(image, thumb_size=thumb_size)
+ except Exception as ex: # pylint: disable=broad-except
+ logger.warning(
+ "Failed at resizing thumbnail %s", ex,
exc_info=True
+ )
+ cache_payload.error()
+ image = None
+
+ if image:
+ with event_logger.log_context(
+ f"screenshot.cache.{self.thumbnail_type}"
+ ):
+ cache_payload.update(image)
+
+ logger.info("Caching thumbnail: %s", cache_key)
+ self.cache.set(cache_key, cache_payload.to_dict())
+ logger.info(
+ "Updated thumbnail cache; Status: %s",
cache_payload.get_status()
+ )
+ except AcquireDistributedLockFailedException:
+ logger.info(
+ "Skipping duplicate thumbnail task for %s - lock already held",
+ cache_key,
+ )
Review Comment:
**Suggestion:** This handler swallows every
`AcquireDistributedLockFailedException`, but lock acquisition raises that same
exception both for normal lock contention and for Redis/backend failures. As
written, infrastructure failures are silently treated as "duplicate task" and
the thumbnail is never generated or retried, causing persistent 202 responses
without recovery. Only suppress true lock-contention cases and re-raise
backend/acquire errors so Celery can retry and operators can see failures.
[incorrect condition logic]
<details>
<summary><b>Severity Level:</b> Major ⚠️</summary>
```mdx
- ❌ Dashboard thumbnail endpoint can return 202 indefinitely.
- ⚠️ Celery thumbnail tasks hide Redis/DB lock failures.
- ⚠️ Operators lose visibility into lock backend outages.
```
</details>
<details>
<summary><b>Steps of Reproduction ✅ </b></summary>
```mdx
1. Call the dashboard thumbnail endpoint `GET
/api/v1/dashboard/<pk>/thumbnail/<digest>/`
implemented in `superset/dashboards/api.py:1584-1699`. The handler constructs
`DashboardScreenshot` and a `cache_key`, then loads `ScreenshotCachePayload`
from cache or
creates a new one and checks `should_trigger_task()` at
`superset/dashboards/api.py:75-85`.
2. When `cache_payload.should_trigger_task()` returns True (status PENDING
or stale), the
API enqueues the Celery task `cache_dashboard_thumbnail` via `.delay(...)` at
`superset/dashboards/api.py:86-95`, which executes
`superset/tasks/thumbnails.py:77-114`.
Inside this task, `DashboardScreenshot.compute_and_cache(...)` is invoked at
`superset/tasks/thumbnails.py:104-114`.
3. `compute_and_cache()` in `superset/utils/screenshots.py:26-105` wraps its
body in
`DistributedLock(namespace="thumbnail", key=cache_key,
ttl_seconds=THUMBNAIL_COMPUTING_CACHE_TTL)` at lines 45-50. The
`DistributedLock` context
manager calls `AcquireDistributedLock.run()` which in turn either
`_acquire_redis` or
`_acquire_kv` at `superset/distributed_lock/__init__.py:52-60` and
`superset/commands/distributed_lock/acquire.py:29-35`.
4. If the lock backend fails (e.g., Redis connection error in
`_acquire_redis`, caught and
re-raised as `AcquireDistributedLockFailedException` at
`superset/commands/distributed_lock/acquire.py:57-60`, or a DB/KeyValue
error re-raised as
`AcquireDistributedLockFailedException` via the `@transaction(on_error=...,
reraise=AcquireDistributedLockFailedException)` decorator at
`superset/commands/distributed_lock/acquire.py:63-72`),
`compute_and_cache()` catches this
in the `except AcquireDistributedLockFailedException:` block at
`superset/utils/screenshots.py:100-104` and only logs "Skipping duplicate
thumbnail task
for %s - lock already held" before returning. The Celery task completes
successfully with
no cache update, leaving the status as PENDING; subsequent calls to the
thumbnail API
continue to enqueue tasks that immediately hit the same backend failure and
are silently
treated as duplicates, so clients persistently receive 202 responses
indicating async
computation without any thumbnail ever being generated or retried.
```
</details>
[](https://app.codeant.ai/fix-in-ide?tool=cursor&prompt_id=cf1d807b33cd451d8fe4585185ff8fad&service=github&base_url=https%3A%2F%2Fgithub.com&org=apache&repo=apache%2Fsuperset)
[](https://app.codeant.ai/fix-in-ide?tool=vscode-claude&prompt_id=cf1d807b33cd451d8fe4585185ff8fad&service=github&base_url=https%3A%2F%2Fgithub.com&org=apache&repo=apache%2Fsuperset)
*(Use Cmd/Ctrl + Click for best experience)*
<details>
<summary><b>Prompt for AI Agent 🤖 </b></summary>
```mdx
This is a comment left during a code review.
**Path:** superset/utils/screenshots.py
**Line:** 339:343
**Comment:**
*Incorrect Condition Logic: This handler swallows every
`AcquireDistributedLockFailedException`, but lock acquisition raises that same
exception both for normal lock contention and for Redis/backend failures. As
written, infrastructure failures are silently treated as "duplicate task" and
the thumbnail is never generated or retried, causing persistent 202 responses
without recovery. Only suppress true lock-contention cases and re-raise
backend/acquire errors so Celery can retry and operators can see failures.
Validate the correctness of the flagged issue. If correct, How can I resolve
this? If you propose a fix, implement it and please make it concise.
Once fix is implemented, also check other comments on the same PR, and ask
user if the user wants to fix the rest of the comments as well. if said yes,
then fetch all the comments validate the correctness and implement a minimal fix
```
</details>
<a
href='https://app.codeant.ai/feedback?pr_url=https%3A%2F%2Fgithub.com%2Fapache%2Fsuperset%2Fpull%2F38576&comment_hash=653f93898c1d57423a76e3a90231827bfd3704a50a87652bae178ae02c5cd1d5&reaction=like'>👍</a>
| <a
href='https://app.codeant.ai/feedback?pr_url=https%3A%2F%2Fgithub.com%2Fapache%2Fsuperset%2Fpull%2F38576&comment_hash=653f93898c1d57423a76e3a90231827bfd3704a50a87652bae178ae02c5cd1d5&reaction=dislike'>👎</a>
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]