codeant-ai-for-open-source[bot] commented on code in PR #40663:
URL: https://github.com/apache/superset/pull/40663#discussion_r3338225505


##########
superset/key_value/commands/prune.py:
##########
@@ -0,0 +1,127 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+from __future__ import annotations
+
+import logging
+import time
+from datetime import datetime
+
+import sqlalchemy as sa
+
+from superset import db
+from superset.commands.base import BaseCommand
+from superset.key_value.models import KeyValueEntry
+
+logger = logging.getLogger(__name__)
+
+
+# pylint: disable=consider-using-transaction
+class KeyValuePruneCommand(BaseCommand):
+    """
+    Command to prune the key-value store by deleting entries whose expiry has
+    already passed.
+
+    The metastore-backed key-value store keeps an `expires_on` timestamp for
+    entries written with a timeout (for example, the metastore cache backend).
+    Unlike cache backends that evict on read, the metastore does not remove 
rows
+    on its own, so expired entries remain in the table until something deletes
+    them. This command performs that housekeeping by deleting every entry whose
+    `expires_on` is in the past, freeing up space in the table.
+
+    Attributes:
+        max_rows_per_run (int | None): The maximum number of rows to delete in 
a
+                                       single run. If provided and greater than
+                                       zero, rows are selected 
deterministically
+                                       from the oldest first by id up to this
+                                       limit in this execution.
+    """
+
+    def __init__(self, max_rows_per_run: int | None = None):
+        """
+        :param max_rows_per_run: The maximum number of rows to delete in a 
single run.
+            If provided and greater than zero, rows are selected 
deterministically from
+            the oldest first by id up to this limit in this execution.
+        """
+        self.max_rows_per_run = max_rows_per_run
+
+    def run(self) -> None:
+        """
+        Executes the prune command
+        """
+        batch_size = 999  # SQLite has a IN clause limit of 999
+        total_deleted = 0
+        start_time = time.time()
+
+        # Select all IDs whose expiry has already passed. Entries without an
+        # expiry (expires_on IS NULL) never expire and are left untouched.
+        select_stmt = sa.select(KeyValueEntry.id).where(
+            KeyValueEntry.expires_on.isnot(None),
+            KeyValueEntry.expires_on < datetime.now(),
+        )

Review Comment:
   **Suggestion:** The expiry filter uses a strict `<` comparison, but the rest 
of the key-value expiry logic treats equality as expired (`<=`). Entries whose 
expiry timestamp is exactly `now` will be skipped and remain until a later run, 
creating inconsistent expiration behavior. Use `<=` to match existing expiry 
semantics. [incorrect condition logic]
   
   <details>
   <summary><b>Severity Level:</b> Major ⚠️</summary>
   
   ```mdx
   - ⚠️ Boundary-expired key-value rows can remain unpruned.
   - ⚠️ Expiry semantics differ between DAO and prune command.
   ```
   </details>
   <details>
   <summary><b>Steps of Reproduction ✅ </b></summary>
   
   ```mdx
   1. Note that the core expiry helper `KeyValueEntry.is_expired()` in
   `superset/key_value/models.py:16-17` treats equality as expired via 
`self.expires_on <=
   datetime.now()`, and `KeyValueDAO.delete_expired_entries()` in
   `superset/daos/key_value.py:31-41` also uses `KeyValueEntry.expires_on <= 
datetime.now()`
   when pruning by resource.
   
   2. Insert a key-value entry with an `expires_on` timestamp equal to a 
specific boundary
   time using `KeyValueDAO.create_entry()` in 
`superset/daos/key_value.py:10-28` (for
   example, via tests or direct DAO usage) so that at that moment 
`is_expired()` returns
   `True`.
   
   3. Immediately run `KeyValuePruneCommand().run()` either directly from tests 
(as in
   `tests/unit_tests/key_value/prune_test.py:5-21`) or via the 
`prune_key_value` Celery task
   at `superset/tasks/scheduler.py:21-31`; inside `run()`, the selection at
   `superset/key_value/commands/prune.py:71-74` uses `KeyValueEntry.expires_on <
   datetime.now()`, which excludes entries whose `expires_on` is exactly equal 
to
   `datetime.now()` at selection time.
   
   4. Observe that such a boundary-expired row remains in the `key_value` table 
despite being
   considered expired by `KeyValueEntry.is_expired()` and
   `KeyValueDAO.delete_expired_entries()`, showing that the prune command's `<` 
condition
   does not align with the rest of the expiry semantics and can leave some 
expired rows
   unpruned until a later run.
   ```
   </details>
   
   [Fix in 
Cursor](https://app.codeant.ai/fix-in-ide?tool=cursor&prompt_id=deb816109e3743828915fcf4e2749854&service=github&base_url=https%3A%2F%2Fgithub.com&org=apache&repo=apache%2Fsuperset)
 | [Fix in VSCode 
Claude](https://app.codeant.ai/fix-in-ide?tool=vscode-claude&prompt_id=deb816109e3743828915fcf4e2749854&service=github&base_url=https%3A%2F%2Fgithub.com&org=apache&repo=apache%2Fsuperset)
   
   *(Use Cmd/Ctrl + Click for best experience)*
   <details>
   <summary><b>Prompt for AI Agent 🤖 </b></summary>
   
   ```mdx
   This is a comment left during a code review.
   
   **Path:** superset/key_value/commands/prune.py
   **Line:** 71:74
   **Comment:**
        *Incorrect Condition Logic: The expiry filter uses a strict `<` 
comparison, but the rest of the key-value expiry logic treats equality as 
expired (`<=`). Entries whose expiry timestamp is exactly `now` will be skipped 
and remain until a later run, creating inconsistent expiration behavior. Use 
`<=` to match existing expiry semantics.
   
   Validate the correctness of the flagged issue. If correct, How can I resolve 
this? If you propose a fix, implement it and please make it concise.
   Once fix is implemented, also check other comments on the same PR, and ask 
user if the user wants to fix the rest of the comments as well. if said yes, 
then fetch all the comments validate the correctness and implement a minimal fix
   ```
   </details>
   <a 
href='https://app.codeant.ai/feedback?pr_url=https%3A%2F%2Fgithub.com%2Fapache%2Fsuperset%2Fpull%2F40663&comment_hash=2a785a5f9995ba4b883b38e09d8ab4a1a0e5a357ec92d0aac3af39363d5e173b&reaction=like'>👍</a>
 | <a 
href='https://app.codeant.ai/feedback?pr_url=https%3A%2F%2Fgithub.com%2Fapache%2Fsuperset%2Fpull%2F40663&comment_hash=2a785a5f9995ba4b883b38e09d8ab4a1a0e5a357ec92d0aac3af39363d5e173b&reaction=dislike'>👎</a>



##########
superset/key_value/commands/prune.py:
##########
@@ -0,0 +1,127 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+from __future__ import annotations
+
+import logging
+import time
+from datetime import datetime
+
+import sqlalchemy as sa
+
+from superset import db
+from superset.commands.base import BaseCommand
+from superset.key_value.models import KeyValueEntry
+
+logger = logging.getLogger(__name__)
+
+
+# pylint: disable=consider-using-transaction
+class KeyValuePruneCommand(BaseCommand):
+    """
+    Command to prune the key-value store by deleting entries whose expiry has
+    already passed.
+
+    The metastore-backed key-value store keeps an `expires_on` timestamp for
+    entries written with a timeout (for example, the metastore cache backend).
+    Unlike cache backends that evict on read, the metastore does not remove 
rows
+    on its own, so expired entries remain in the table until something deletes
+    them. This command performs that housekeeping by deleting every entry whose
+    `expires_on` is in the past, freeing up space in the table.
+
+    Attributes:
+        max_rows_per_run (int | None): The maximum number of rows to delete in 
a
+                                       single run. If provided and greater than
+                                       zero, rows are selected 
deterministically
+                                       from the oldest first by id up to this
+                                       limit in this execution.
+    """
+
+    def __init__(self, max_rows_per_run: int | None = None):
+        """
+        :param max_rows_per_run: The maximum number of rows to delete in a 
single run.
+            If provided and greater than zero, rows are selected 
deterministically from
+            the oldest first by id up to this limit in this execution.
+        """
+        self.max_rows_per_run = max_rows_per_run
+
+    def run(self) -> None:
+        """
+        Executes the prune command
+        """
+        batch_size = 999  # SQLite has a IN clause limit of 999
+        total_deleted = 0
+        start_time = time.time()
+
+        # Select all IDs whose expiry has already passed. Entries without an
+        # expiry (expires_on IS NULL) never expire and are left untouched.
+        select_stmt = sa.select(KeyValueEntry.id).where(
+            KeyValueEntry.expires_on.isnot(None),
+            KeyValueEntry.expires_on < datetime.now(),
+        )
+
+        # Optionally limited by max_rows_per_run
+        # order by oldest first for deterministic deletion
+        if self.max_rows_per_run is not None and self.max_rows_per_run > 0:
+            select_stmt = select_stmt.order_by(KeyValueEntry.id.asc()).limit(
+                self.max_rows_per_run
+            )
+
+        ids_to_delete = db.session.execute(select_stmt).scalars().all()
+
+        total_rows = len(ids_to_delete)

Review Comment:
   **Suggestion:** This loads every expired ID into memory in one shot when no 
cap is set, which can become very large in the exact scenario this task targets 
(tables with many accumulated expired rows). For large datasets this can cause 
high memory usage or worker instability; stream/batch ID selection from the 
database instead of materializing all IDs at once. [performance]
   
   <details>
   <summary><b>Severity Level:</b> Major ⚠️</summary>
   
   ```mdx
   - ⚠️ prune_key_value Celery worker may experience high memory use.
   - ⚠️ Large key_value tables make pruning jobs less reliable.
   ```
   </details>
   <details>
   <summary><b>Steps of Reproduction ✅ </b></summary>
   
   ```mdx
   1. Run Superset in a configuration where `FILTER_STATE_CACHE_CONFIG` and
   `EXPLORE_FORM_DATA_CACHE_CONFIG` in `superset/config.py:1124-1162` use
   `SupersetMetastoreCache`, causing filter state and explore form data to be 
stored in the
   `key_value` table via `SupersetMetastoreCache.set()`
   (`superset/extensions/metastore_cache.py:11-21`) and 
`KeyValueDAO.upsert_entry()`
   (`superset/daos/key_value.py:41-55`).
   
   2. Allow the system to accumulate a very large number of expired 
`KeyValueEntry` rows over
   time in the `key_value` table (the exact scenario this housekeeping command 
targets), for
   example by heavy use of filter state and explore form data with finite TTLs.
   
   3. Trigger the `prune_key_value` Celery task defined at
   `superset/tasks/scheduler.py:21-31` without setting `max_rows_per_run` (or 
by calling
   `KeyValuePruneCommand()` directly with `max_rows_per_run=None`), so the 
`select_stmt` in
   `KeyValuePruneCommand.run()` (`superset/key_value/commands/prune.py:69-81`) 
is not
   limited.
   
   4. When `KeyValuePruneCommand.run()` executes, it evaluates `ids_to_delete =
   db.session.execute(select_stmt).scalars().all()` and `total_rows = 
len(ids_to_delete)` at
   `superset/key_value/commands/prune.py:83-85`, materializing all expired row 
IDs into a
   Python list; with hundreds of thousands or millions of expired rows this 
list can consume
   substantial memory in the Celery worker process, leading to high memory 
usage or
   instability during pruning.
   ```
   </details>
   
   [Fix in 
Cursor](https://app.codeant.ai/fix-in-ide?tool=cursor&prompt_id=de0f999e1c6c4907b092a74b6c4e493c&service=github&base_url=https%3A%2F%2Fgithub.com&org=apache&repo=apache%2Fsuperset)
 | [Fix in VSCode 
Claude](https://app.codeant.ai/fix-in-ide?tool=vscode-claude&prompt_id=de0f999e1c6c4907b092a74b6c4e493c&service=github&base_url=https%3A%2F%2Fgithub.com&org=apache&repo=apache%2Fsuperset)
   
   *(Use Cmd/Ctrl + Click for best experience)*
   <details>
   <summary><b>Prompt for AI Agent 🤖 </b></summary>
   
   ```mdx
   This is a comment left during a code review.
   
   **Path:** superset/key_value/commands/prune.py
   **Line:** 83:85
   **Comment:**
        *Performance: This loads every expired ID into memory in one shot when 
no cap is set, which can become very large in the exact scenario this task 
targets (tables with many accumulated expired rows). For large datasets this 
can cause high memory usage or worker instability; stream/batch ID selection 
from the database instead of materializing all IDs at once.
   
   Validate the correctness of the flagged issue. If correct, How can I resolve 
this? If you propose a fix, implement it and please make it concise.
   Once fix is implemented, also check other comments on the same PR, and ask 
user if the user wants to fix the rest of the comments as well. if said yes, 
then fetch all the comments validate the correctness and implement a minimal fix
   ```
   </details>
   <a 
href='https://app.codeant.ai/feedback?pr_url=https%3A%2F%2Fgithub.com%2Fapache%2Fsuperset%2Fpull%2F40663&comment_hash=0be5830fda42fdb902acd1d7cb53f885e5567054c9bb164f9e2bf0b3ad676679&reaction=like'>👍</a>
 | <a 
href='https://app.codeant.ai/feedback?pr_url=https%3A%2F%2Fgithub.com%2Fapache%2Fsuperset%2Fpull%2F40663&comment_hash=0be5830fda42fdb902acd1d7cb53f885e5567054c9bb164f9e2bf0b3ad676679&reaction=dislike'>👎</a>



##########
superset/key_value/commands/prune.py:
##########
@@ -0,0 +1,127 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+from __future__ import annotations
+
+import logging
+import time
+from datetime import datetime
+
+import sqlalchemy as sa
+
+from superset import db
+from superset.commands.base import BaseCommand
+from superset.key_value.models import KeyValueEntry
+
+logger = logging.getLogger(__name__)
+
+
+# pylint: disable=consider-using-transaction
+class KeyValuePruneCommand(BaseCommand):
+    """
+    Command to prune the key-value store by deleting entries whose expiry has
+    already passed.
+
+    The metastore-backed key-value store keeps an `expires_on` timestamp for
+    entries written with a timeout (for example, the metastore cache backend).
+    Unlike cache backends that evict on read, the metastore does not remove 
rows
+    on its own, so expired entries remain in the table until something deletes
+    them. This command performs that housekeeping by deleting every entry whose
+    `expires_on` is in the past, freeing up space in the table.
+
+    Attributes:
+        max_rows_per_run (int | None): The maximum number of rows to delete in 
a
+                                       single run. If provided and greater than
+                                       zero, rows are selected 
deterministically
+                                       from the oldest first by id up to this
+                                       limit in this execution.
+    """
+
+    def __init__(self, max_rows_per_run: int | None = None):
+        """
+        :param max_rows_per_run: The maximum number of rows to delete in a 
single run.
+            If provided and greater than zero, rows are selected 
deterministically from
+            the oldest first by id up to this limit in this execution.
+        """
+        self.max_rows_per_run = max_rows_per_run
+
+    def run(self) -> None:
+        """
+        Executes the prune command
+        """
+        batch_size = 999  # SQLite has a IN clause limit of 999
+        total_deleted = 0
+        start_time = time.time()
+
+        # Select all IDs whose expiry has already passed. Entries without an
+        # expiry (expires_on IS NULL) never expire and are left untouched.
+        select_stmt = sa.select(KeyValueEntry.id).where(
+            KeyValueEntry.expires_on.isnot(None),
+            KeyValueEntry.expires_on < datetime.now(),
+        )
+
+        # Optionally limited by max_rows_per_run
+        # order by oldest first for deterministic deletion
+        if self.max_rows_per_run is not None and self.max_rows_per_run > 0:
+            select_stmt = select_stmt.order_by(KeyValueEntry.id.asc()).limit(
+                self.max_rows_per_run
+            )
+
+        ids_to_delete = db.session.execute(select_stmt).scalars().all()
+
+        total_rows = len(ids_to_delete)
+
+        logger.info("Total rows to be deleted: %s", f"{total_rows:,}")
+
+        next_logging_threshold = 1
+
+        # Iterate over the IDs in batches
+        for i in range(0, total_rows, batch_size):
+            batch_ids = ids_to_delete[i : i + batch_size]
+
+            # Delete the selected batch using IN clause
+            result = db.session.execute(
+                sa.delete(KeyValueEntry).where(KeyValueEntry.id.in_(batch_ids))
+            )

Review Comment:
   **Suggestion:** This delete step is vulnerable to a race: IDs are selected 
first, then deleted later without re-checking expiry. If another request 
refreshes an entry (updates `expires_on` to the future) between those two 
operations, this code will still delete the now-valid row. Add an expiry 
predicate to the delete statement (using the same cutoff timestamp captured 
before selection) so only still-expired rows are removed. [race condition]
   
   <details>
   <summary><b>Severity Level:</b> Critical 🚨</summary>
   
   ```mdx
   - ❌ Valid metastore cache entries can be deleted prematurely.
   - ❌ Filter state cache values can disappear unexpectedly.
   - ⚠️ Explore form data cache may be pruned while active.
   ```
   </details>
   <details>
   <summary><b>Steps of Reproduction ✅ </b></summary>
   
   ```mdx
   1. Configure Superset with the default metastore-backed caches for filter 
state and
   explore form data, as defined in `superset/config.py:1124-1143` and
   `superset/config.py:1143-1162`, which use `SupersetMetastoreCache` as the 
cache backend.
   
   2. Store a cached value via `SupersetMetastoreCache.set()` in
   `superset/extensions/metastore_cache.py:11-21`, which calls 
`KeyValueDAO.upsert_entry()`
   in `superset/daos/key_value.py:41-55` and sets `KeyValueEntry.expires_on` to 
a future
   time.
   
   3. Allow the cached entry to expire so that `KeyValueEntry.is_expired()` in
   `superset/key_value/models.py:16-17` returns `True`, then start the 
`prune_key_value`
   Celery task defined in `superset/tasks/scheduler.py:21-35`, which calls
   `KeyValuePruneCommand(max_rows_per_run).run()` at 
`superset/tasks/scheduler.py:30-31`.
   
   4. While `KeyValuePruneCommand.run()` is executing, after it selects expired 
IDs at
   `superset/key_value/commands/prune.py:71-74` and before deleting them, issue 
another
   `SupersetMetastoreCache.set()` for the same key, which updates 
`entry.expires_on` to a
   future value via `KeyValueDAO.upsert_entry()` 
(`superset/daos/key_value.py:47-53`); the
   prune command then executes the delete at 
`superset/key_value/commands/prune.py:96-98`
   using only `KeyValueEntry.id.in_(batch_ids)`, deleting the now-refreshed 
(non-expired) row
   because the delete does not re-check `expires_on`.
   ```
   </details>
   
   [Fix in 
Cursor](https://app.codeant.ai/fix-in-ide?tool=cursor&prompt_id=ab1e302414434948b1ee9f00235af900&service=github&base_url=https%3A%2F%2Fgithub.com&org=apache&repo=apache%2Fsuperset)
 | [Fix in VSCode 
Claude](https://app.codeant.ai/fix-in-ide?tool=vscode-claude&prompt_id=ab1e302414434948b1ee9f00235af900&service=github&base_url=https%3A%2F%2Fgithub.com&org=apache&repo=apache%2Fsuperset)
   
   *(Use Cmd/Ctrl + Click for best experience)*
   <details>
   <summary><b>Prompt for AI Agent 🤖 </b></summary>
   
   ```mdx
   This is a comment left during a code review.
   
   **Path:** superset/key_value/commands/prune.py
   **Line:** 96:98
   **Comment:**
        *Race Condition: This delete step is vulnerable to a race: IDs are 
selected first, then deleted later without re-checking expiry. If another 
request refreshes an entry (updates `expires_on` to the future) between those 
two operations, this code will still delete the now-valid row. Add an expiry 
predicate to the delete statement (using the same cutoff timestamp captured 
before selection) so only still-expired rows are removed.
   
   Validate the correctness of the flagged issue. If correct, How can I resolve 
this? If you propose a fix, implement it and please make it concise.
   Once fix is implemented, also check other comments on the same PR, and ask 
user if the user wants to fix the rest of the comments as well. if said yes, 
then fetch all the comments validate the correctness and implement a minimal fix
   ```
   </details>
   <a 
href='https://app.codeant.ai/feedback?pr_url=https%3A%2F%2Fgithub.com%2Fapache%2Fsuperset%2Fpull%2F40663&comment_hash=59a84b82852c9ad005b94f6b2b617f309e20b12ac49f0a080a240ee2d30a6f9c&reaction=like'>👍</a>
 | <a 
href='https://app.codeant.ai/feedback?pr_url=https%3A%2F%2Fgithub.com%2Fapache%2Fsuperset%2Fpull%2F40663&comment_hash=59a84b82852c9ad005b94f6b2b617f309e20b12ac49f0a080a240ee2d30a6f9c&reaction=dislike'>👎</a>



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to