codeant-ai-for-open-source[bot] commented on code in PR #40663: URL: https://github.com/apache/superset/pull/40663#discussion_r3338225505
########## superset/key_value/commands/prune.py: ########## @@ -0,0 +1,127 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. +from __future__ import annotations + +import logging +import time +from datetime import datetime + +import sqlalchemy as sa + +from superset import db +from superset.commands.base import BaseCommand +from superset.key_value.models import KeyValueEntry + +logger = logging.getLogger(__name__) + + +# pylint: disable=consider-using-transaction +class KeyValuePruneCommand(BaseCommand): + """ + Command to prune the key-value store by deleting entries whose expiry has + already passed. + + The metastore-backed key-value store keeps an `expires_on` timestamp for + entries written with a timeout (for example, the metastore cache backend). + Unlike cache backends that evict on read, the metastore does not remove rows + on its own, so expired entries remain in the table until something deletes + them. This command performs that housekeeping by deleting every entry whose + `expires_on` is in the past, freeing up space in the table. + + Attributes: + max_rows_per_run (int | None): The maximum number of rows to delete in a + single run. If provided and greater than + zero, rows are selected deterministically + from the oldest first by id up to this + limit in this execution. + """ + + def __init__(self, max_rows_per_run: int | None = None): + """ + :param max_rows_per_run: The maximum number of rows to delete in a single run. + If provided and greater than zero, rows are selected deterministically from + the oldest first by id up to this limit in this execution. + """ + self.max_rows_per_run = max_rows_per_run + + def run(self) -> None: + """ + Executes the prune command + """ + batch_size = 999 # SQLite has a IN clause limit of 999 + total_deleted = 0 + start_time = time.time() + + # Select all IDs whose expiry has already passed. Entries without an + # expiry (expires_on IS NULL) never expire and are left untouched. + select_stmt = sa.select(KeyValueEntry.id).where( + KeyValueEntry.expires_on.isnot(None), + KeyValueEntry.expires_on < datetime.now(), + ) Review Comment: **Suggestion:** The expiry filter uses a strict `<` comparison, but the rest of the key-value expiry logic treats equality as expired (`<=`). Entries whose expiry timestamp is exactly `now` will be skipped and remain until a later run, creating inconsistent expiration behavior. Use `<=` to match existing expiry semantics. [incorrect condition logic] <details> <summary><b>Severity Level:</b> Major ⚠️</summary> ```mdx - ⚠️ Boundary-expired key-value rows can remain unpruned. - ⚠️ Expiry semantics differ between DAO and prune command. ``` </details> <details> <summary><b>Steps of Reproduction ✅ </b></summary> ```mdx 1. Note that the core expiry helper `KeyValueEntry.is_expired()` in `superset/key_value/models.py:16-17` treats equality as expired via `self.expires_on <= datetime.now()`, and `KeyValueDAO.delete_expired_entries()` in `superset/daos/key_value.py:31-41` also uses `KeyValueEntry.expires_on <= datetime.now()` when pruning by resource. 2. Insert a key-value entry with an `expires_on` timestamp equal to a specific boundary time using `KeyValueDAO.create_entry()` in `superset/daos/key_value.py:10-28` (for example, via tests or direct DAO usage) so that at that moment `is_expired()` returns `True`. 3. Immediately run `KeyValuePruneCommand().run()` either directly from tests (as in `tests/unit_tests/key_value/prune_test.py:5-21`) or via the `prune_key_value` Celery task at `superset/tasks/scheduler.py:21-31`; inside `run()`, the selection at `superset/key_value/commands/prune.py:71-74` uses `KeyValueEntry.expires_on < datetime.now()`, which excludes entries whose `expires_on` is exactly equal to `datetime.now()` at selection time. 4. Observe that such a boundary-expired row remains in the `key_value` table despite being considered expired by `KeyValueEntry.is_expired()` and `KeyValueDAO.delete_expired_entries()`, showing that the prune command's `<` condition does not align with the rest of the expiry semantics and can leave some expired rows unpruned until a later run. ``` </details> [Fix in Cursor](https://app.codeant.ai/fix-in-ide?tool=cursor&prompt_id=deb816109e3743828915fcf4e2749854&service=github&base_url=https%3A%2F%2Fgithub.com&org=apache&repo=apache%2Fsuperset) | [Fix in VSCode Claude](https://app.codeant.ai/fix-in-ide?tool=vscode-claude&prompt_id=deb816109e3743828915fcf4e2749854&service=github&base_url=https%3A%2F%2Fgithub.com&org=apache&repo=apache%2Fsuperset) *(Use Cmd/Ctrl + Click for best experience)* <details> <summary><b>Prompt for AI Agent 🤖 </b></summary> ```mdx This is a comment left during a code review. **Path:** superset/key_value/commands/prune.py **Line:** 71:74 **Comment:** *Incorrect Condition Logic: The expiry filter uses a strict `<` comparison, but the rest of the key-value expiry logic treats equality as expired (`<=`). Entries whose expiry timestamp is exactly `now` will be skipped and remain until a later run, creating inconsistent expiration behavior. Use `<=` to match existing expiry semantics. Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise. Once fix is implemented, also check other comments on the same PR, and ask user if the user wants to fix the rest of the comments as well. if said yes, then fetch all the comments validate the correctness and implement a minimal fix ``` </details> <a href='https://app.codeant.ai/feedback?pr_url=https%3A%2F%2Fgithub.com%2Fapache%2Fsuperset%2Fpull%2F40663&comment_hash=2a785a5f9995ba4b883b38e09d8ab4a1a0e5a357ec92d0aac3af39363d5e173b&reaction=like'>👍</a> | <a href='https://app.codeant.ai/feedback?pr_url=https%3A%2F%2Fgithub.com%2Fapache%2Fsuperset%2Fpull%2F40663&comment_hash=2a785a5f9995ba4b883b38e09d8ab4a1a0e5a357ec92d0aac3af39363d5e173b&reaction=dislike'>👎</a> ########## superset/key_value/commands/prune.py: ########## @@ -0,0 +1,127 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. +from __future__ import annotations + +import logging +import time +from datetime import datetime + +import sqlalchemy as sa + +from superset import db +from superset.commands.base import BaseCommand +from superset.key_value.models import KeyValueEntry + +logger = logging.getLogger(__name__) + + +# pylint: disable=consider-using-transaction +class KeyValuePruneCommand(BaseCommand): + """ + Command to prune the key-value store by deleting entries whose expiry has + already passed. + + The metastore-backed key-value store keeps an `expires_on` timestamp for + entries written with a timeout (for example, the metastore cache backend). + Unlike cache backends that evict on read, the metastore does not remove rows + on its own, so expired entries remain in the table until something deletes + them. This command performs that housekeeping by deleting every entry whose + `expires_on` is in the past, freeing up space in the table. + + Attributes: + max_rows_per_run (int | None): The maximum number of rows to delete in a + single run. If provided and greater than + zero, rows are selected deterministically + from the oldest first by id up to this + limit in this execution. + """ + + def __init__(self, max_rows_per_run: int | None = None): + """ + :param max_rows_per_run: The maximum number of rows to delete in a single run. + If provided and greater than zero, rows are selected deterministically from + the oldest first by id up to this limit in this execution. + """ + self.max_rows_per_run = max_rows_per_run + + def run(self) -> None: + """ + Executes the prune command + """ + batch_size = 999 # SQLite has a IN clause limit of 999 + total_deleted = 0 + start_time = time.time() + + # Select all IDs whose expiry has already passed. Entries without an + # expiry (expires_on IS NULL) never expire and are left untouched. + select_stmt = sa.select(KeyValueEntry.id).where( + KeyValueEntry.expires_on.isnot(None), + KeyValueEntry.expires_on < datetime.now(), + ) + + # Optionally limited by max_rows_per_run + # order by oldest first for deterministic deletion + if self.max_rows_per_run is not None and self.max_rows_per_run > 0: + select_stmt = select_stmt.order_by(KeyValueEntry.id.asc()).limit( + self.max_rows_per_run + ) + + ids_to_delete = db.session.execute(select_stmt).scalars().all() + + total_rows = len(ids_to_delete) Review Comment: **Suggestion:** This loads every expired ID into memory in one shot when no cap is set, which can become very large in the exact scenario this task targets (tables with many accumulated expired rows). For large datasets this can cause high memory usage or worker instability; stream/batch ID selection from the database instead of materializing all IDs at once. [performance] <details> <summary><b>Severity Level:</b> Major ⚠️</summary> ```mdx - ⚠️ prune_key_value Celery worker may experience high memory use. - ⚠️ Large key_value tables make pruning jobs less reliable. ``` </details> <details> <summary><b>Steps of Reproduction ✅ </b></summary> ```mdx 1. Run Superset in a configuration where `FILTER_STATE_CACHE_CONFIG` and `EXPLORE_FORM_DATA_CACHE_CONFIG` in `superset/config.py:1124-1162` use `SupersetMetastoreCache`, causing filter state and explore form data to be stored in the `key_value` table via `SupersetMetastoreCache.set()` (`superset/extensions/metastore_cache.py:11-21`) and `KeyValueDAO.upsert_entry()` (`superset/daos/key_value.py:41-55`). 2. Allow the system to accumulate a very large number of expired `KeyValueEntry` rows over time in the `key_value` table (the exact scenario this housekeeping command targets), for example by heavy use of filter state and explore form data with finite TTLs. 3. Trigger the `prune_key_value` Celery task defined at `superset/tasks/scheduler.py:21-31` without setting `max_rows_per_run` (or by calling `KeyValuePruneCommand()` directly with `max_rows_per_run=None`), so the `select_stmt` in `KeyValuePruneCommand.run()` (`superset/key_value/commands/prune.py:69-81`) is not limited. 4. When `KeyValuePruneCommand.run()` executes, it evaluates `ids_to_delete = db.session.execute(select_stmt).scalars().all()` and `total_rows = len(ids_to_delete)` at `superset/key_value/commands/prune.py:83-85`, materializing all expired row IDs into a Python list; with hundreds of thousands or millions of expired rows this list can consume substantial memory in the Celery worker process, leading to high memory usage or instability during pruning. ``` </details> [Fix in Cursor](https://app.codeant.ai/fix-in-ide?tool=cursor&prompt_id=de0f999e1c6c4907b092a74b6c4e493c&service=github&base_url=https%3A%2F%2Fgithub.com&org=apache&repo=apache%2Fsuperset) | [Fix in VSCode Claude](https://app.codeant.ai/fix-in-ide?tool=vscode-claude&prompt_id=de0f999e1c6c4907b092a74b6c4e493c&service=github&base_url=https%3A%2F%2Fgithub.com&org=apache&repo=apache%2Fsuperset) *(Use Cmd/Ctrl + Click for best experience)* <details> <summary><b>Prompt for AI Agent 🤖 </b></summary> ```mdx This is a comment left during a code review. **Path:** superset/key_value/commands/prune.py **Line:** 83:85 **Comment:** *Performance: This loads every expired ID into memory in one shot when no cap is set, which can become very large in the exact scenario this task targets (tables with many accumulated expired rows). For large datasets this can cause high memory usage or worker instability; stream/batch ID selection from the database instead of materializing all IDs at once. Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise. Once fix is implemented, also check other comments on the same PR, and ask user if the user wants to fix the rest of the comments as well. if said yes, then fetch all the comments validate the correctness and implement a minimal fix ``` </details> <a href='https://app.codeant.ai/feedback?pr_url=https%3A%2F%2Fgithub.com%2Fapache%2Fsuperset%2Fpull%2F40663&comment_hash=0be5830fda42fdb902acd1d7cb53f885e5567054c9bb164f9e2bf0b3ad676679&reaction=like'>👍</a> | <a href='https://app.codeant.ai/feedback?pr_url=https%3A%2F%2Fgithub.com%2Fapache%2Fsuperset%2Fpull%2F40663&comment_hash=0be5830fda42fdb902acd1d7cb53f885e5567054c9bb164f9e2bf0b3ad676679&reaction=dislike'>👎</a> ########## superset/key_value/commands/prune.py: ########## @@ -0,0 +1,127 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. +from __future__ import annotations + +import logging +import time +from datetime import datetime + +import sqlalchemy as sa + +from superset import db +from superset.commands.base import BaseCommand +from superset.key_value.models import KeyValueEntry + +logger = logging.getLogger(__name__) + + +# pylint: disable=consider-using-transaction +class KeyValuePruneCommand(BaseCommand): + """ + Command to prune the key-value store by deleting entries whose expiry has + already passed. + + The metastore-backed key-value store keeps an `expires_on` timestamp for + entries written with a timeout (for example, the metastore cache backend). + Unlike cache backends that evict on read, the metastore does not remove rows + on its own, so expired entries remain in the table until something deletes + them. This command performs that housekeeping by deleting every entry whose + `expires_on` is in the past, freeing up space in the table. + + Attributes: + max_rows_per_run (int | None): The maximum number of rows to delete in a + single run. If provided and greater than + zero, rows are selected deterministically + from the oldest first by id up to this + limit in this execution. + """ + + def __init__(self, max_rows_per_run: int | None = None): + """ + :param max_rows_per_run: The maximum number of rows to delete in a single run. + If provided and greater than zero, rows are selected deterministically from + the oldest first by id up to this limit in this execution. + """ + self.max_rows_per_run = max_rows_per_run + + def run(self) -> None: + """ + Executes the prune command + """ + batch_size = 999 # SQLite has a IN clause limit of 999 + total_deleted = 0 + start_time = time.time() + + # Select all IDs whose expiry has already passed. Entries without an + # expiry (expires_on IS NULL) never expire and are left untouched. + select_stmt = sa.select(KeyValueEntry.id).where( + KeyValueEntry.expires_on.isnot(None), + KeyValueEntry.expires_on < datetime.now(), + ) + + # Optionally limited by max_rows_per_run + # order by oldest first for deterministic deletion + if self.max_rows_per_run is not None and self.max_rows_per_run > 0: + select_stmt = select_stmt.order_by(KeyValueEntry.id.asc()).limit( + self.max_rows_per_run + ) + + ids_to_delete = db.session.execute(select_stmt).scalars().all() + + total_rows = len(ids_to_delete) + + logger.info("Total rows to be deleted: %s", f"{total_rows:,}") + + next_logging_threshold = 1 + + # Iterate over the IDs in batches + for i in range(0, total_rows, batch_size): + batch_ids = ids_to_delete[i : i + batch_size] + + # Delete the selected batch using IN clause + result = db.session.execute( + sa.delete(KeyValueEntry).where(KeyValueEntry.id.in_(batch_ids)) + ) Review Comment: **Suggestion:** This delete step is vulnerable to a race: IDs are selected first, then deleted later without re-checking expiry. If another request refreshes an entry (updates `expires_on` to the future) between those two operations, this code will still delete the now-valid row. Add an expiry predicate to the delete statement (using the same cutoff timestamp captured before selection) so only still-expired rows are removed. [race condition] <details> <summary><b>Severity Level:</b> Critical 🚨</summary> ```mdx - ❌ Valid metastore cache entries can be deleted prematurely. - ❌ Filter state cache values can disappear unexpectedly. - ⚠️ Explore form data cache may be pruned while active. ``` </details> <details> <summary><b>Steps of Reproduction ✅ </b></summary> ```mdx 1. Configure Superset with the default metastore-backed caches for filter state and explore form data, as defined in `superset/config.py:1124-1143` and `superset/config.py:1143-1162`, which use `SupersetMetastoreCache` as the cache backend. 2. Store a cached value via `SupersetMetastoreCache.set()` in `superset/extensions/metastore_cache.py:11-21`, which calls `KeyValueDAO.upsert_entry()` in `superset/daos/key_value.py:41-55` and sets `KeyValueEntry.expires_on` to a future time. 3. Allow the cached entry to expire so that `KeyValueEntry.is_expired()` in `superset/key_value/models.py:16-17` returns `True`, then start the `prune_key_value` Celery task defined in `superset/tasks/scheduler.py:21-35`, which calls `KeyValuePruneCommand(max_rows_per_run).run()` at `superset/tasks/scheduler.py:30-31`. 4. While `KeyValuePruneCommand.run()` is executing, after it selects expired IDs at `superset/key_value/commands/prune.py:71-74` and before deleting them, issue another `SupersetMetastoreCache.set()` for the same key, which updates `entry.expires_on` to a future value via `KeyValueDAO.upsert_entry()` (`superset/daos/key_value.py:47-53`); the prune command then executes the delete at `superset/key_value/commands/prune.py:96-98` using only `KeyValueEntry.id.in_(batch_ids)`, deleting the now-refreshed (non-expired) row because the delete does not re-check `expires_on`. ``` </details> [Fix in Cursor](https://app.codeant.ai/fix-in-ide?tool=cursor&prompt_id=ab1e302414434948b1ee9f00235af900&service=github&base_url=https%3A%2F%2Fgithub.com&org=apache&repo=apache%2Fsuperset) | [Fix in VSCode Claude](https://app.codeant.ai/fix-in-ide?tool=vscode-claude&prompt_id=ab1e302414434948b1ee9f00235af900&service=github&base_url=https%3A%2F%2Fgithub.com&org=apache&repo=apache%2Fsuperset) *(Use Cmd/Ctrl + Click for best experience)* <details> <summary><b>Prompt for AI Agent 🤖 </b></summary> ```mdx This is a comment left during a code review. **Path:** superset/key_value/commands/prune.py **Line:** 96:98 **Comment:** *Race Condition: This delete step is vulnerable to a race: IDs are selected first, then deleted later without re-checking expiry. If another request refreshes an entry (updates `expires_on` to the future) between those two operations, this code will still delete the now-valid row. Add an expiry predicate to the delete statement (using the same cutoff timestamp captured before selection) so only still-expired rows are removed. Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise. Once fix is implemented, also check other comments on the same PR, and ask user if the user wants to fix the rest of the comments as well. if said yes, then fetch all the comments validate the correctness and implement a minimal fix ``` </details> <a href='https://app.codeant.ai/feedback?pr_url=https%3A%2F%2Fgithub.com%2Fapache%2Fsuperset%2Fpull%2F40663&comment_hash=59a84b82852c9ad005b94f6b2b617f309e20b12ac49f0a080a240ee2d30a6f9c&reaction=like'>👍</a> | <a href='https://app.codeant.ai/feedback?pr_url=https%3A%2F%2Fgithub.com%2Fapache%2Fsuperset%2Fpull%2F40663&comment_hash=59a84b82852c9ad005b94f6b2b617f309e20b12ac49f0a080a240ee2d30a6f9c&reaction=dislike'>👎</a> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
