codeant-ai-for-open-source[bot] commented on code in PR #37731:
URL: https://github.com/apache/superset/pull/37731#discussion_r2772674342
##########
superset/utils/pandas_postprocessing/histogram.py:
##########
@@ -48,6 +48,9 @@ def histogram(
if groupby is None:
groupby = []
+ # Create an explicit copy to avoid SettingWithCopyWarning
+ df = df.copy()
Review Comment:
**Suggestion:** Creating a full deep copy of the entire DataFrame can be
very expensive in both time and memory for large inputs, potentially leading to
avoidable memory pressure or even MemoryError in high-load situations; using a
shallow copy is sufficient to break the chained-assignment relationship and
prevent SettingWithCopyWarning while avoiding duplicating all the underlying
data. [possible bug]
<details>
<summary><b>Severity Level:</b> Major ⚠️</summary>
```mdx
- ❌ Histogram postprocessing may OOM for large query results.
- ⚠️ Backend worker memory pressure during histogram calculation.
- ⚠️ Dashboard histogram tiles risk failing under large datasets.
```
</details>
```suggestion
# Create a shallow copy to avoid SettingWithCopyWarning without
duplicating all data
df = df.copy(deep=False)
```
<details>
<summary><b>Steps of Reproduction ✅ </b></summary>
```mdx
1. In a Superset environment or Python REPL, import the function:
from superset.utils.pandas_postprocessing.histogram import histogram
(the implementation lives at
superset/utils/pandas_postprocessing/histogram.py and the
copy call is at lines 51-52).
2. Construct a large pandas DataFrame in the same process, e.g.:
df = pandas.DataFrame({"value": numpy.random.rand(10_000_000)}) # created
in memory
prior to calling histogram
3. Call histogram on that DataFrame:
histogram(df, column="value", groupby=None)
Execution will enter superset/utils/pandas_postprocessing/histogram.py
and hit the
df.copy() call at lines 51-52, allocating a full duplicate of the
underlying data.
4. Observe the effect via process monitoring (top/psutil) — memory usage
will spike
roughly doubling DataFrame memory, possibly leading to MemoryError or worker
OOM and
failing the histogram postprocessing step.
```
</details>
<details>
<summary><b>Prompt for AI Agent 🤖 </b></summary>
```mdx
This is a comment left during a code review.
**Path:** superset/utils/pandas_postprocessing/histogram.py
**Line:** 51:52
**Comment:**
*Possible Bug: Creating a full deep copy of the entire DataFrame can be
very expensive in both time and memory for large inputs, potentially leading to
avoidable memory pressure or even MemoryError in high-load situations; using a
shallow copy is sufficient to break the chained-assignment relationship and
prevent SettingWithCopyWarning while avoiding duplicating all the underlying
data.
Validate the correctness of the flagged issue. If correct, How can I resolve
this? If you propose a fix, implement it and please make it concise.
```
</details>
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]