codeant-ai-for-open-source[bot] commented on code in PR #41133:
URL: https://github.com/apache/superset/pull/41133#discussion_r3429850925


##########
superset/utils/excel_streaming.py:
##########
@@ -0,0 +1,190 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+"""
+Streaming, constant-memory XLSX writer for multi-sheet dashboard exports.
+
+Unlike :mod:`superset.utils.excel`, which materializes a whole DataFrame in
+memory, this writer streams rows one at a time into an ``xlsxwriter`` workbook
+opened in ``constant_memory`` mode, so a dashboard with many large charts never
+holds more than one row per sheet in memory at once.
+"""
+
+from __future__ import annotations
+
+import math
+import numbers
+import re
+from collections.abc import Iterable, Sequence
+from datetime import date, datetime
+from decimal import Decimal
+from typing import Any
+
+import xlsxwriter
+
+from superset.utils.excel import NEUTRAL_DOCUMENT_PROPERTIES
+
+# Excel limits a sheet name to 31 characters and forbids these characters.
+MAX_SHEET_NAME_LEN = 31
+_INVALID_SHEET_CHARS_RE = re.compile(r"[\[\]:*?/\\]")
+# Excel reserves the sheet name "History" (case-insensitive).
+_RESERVED_SHEET_NAME = "history"
+
+# A worksheet holds at most 1,048,576 rows; one is reserved for the header.
+MAX_DATA_ROWS_PER_SHEET = 1_048_576 - 1
+
+# Leading characters that turn a cell into a formula in spreadsheet apps. 
Mirrors
+# superset.utils.excel.quote_formulas so streamed exports get the same guard.
+_FORMULA_PREFIXES = {"=", "+", "-", "@"}
+
+# Excel cannot represent integers beyond 10**15 without precision loss.
+_MAX_EXCEL_INT = 10**15
+
+
+def sanitize_sheet_name(raw: str, used: set[str]) -> str:
+    """
+    Produce a valid, unique Excel sheet name from ``raw``.
+
+    Replaces forbidden characters, strips surrounding apostrophes/whitespace,
+    avoids the reserved name "History", truncates to 31 characters, and
+    disambiguates case-insensitive collisions with ``~2``/``~3`` suffixes.
+    The chosen name (lower-cased) is added to ``used``.
+
+    :param raw: The desired sheet name (e.g. ``"42 - Sales by Region"``)
+    :param used: Lower-cased names already taken; mutated with the result
+    :returns: A sanitized, unique sheet name no longer than 31 characters
+    """
+    name = _INVALID_SHEET_CHARS_RE.sub("_", raw or "")
+    name = name.strip().strip("'").strip()
+    if not name:
+        name = "Sheet"
+    if name.lower() == _RESERVED_SHEET_NAME:
+        name = f"{name}_"
+    name = name[:MAX_SHEET_NAME_LEN]
+
+    if name.lower() not in used:
+        used.add(name.lower())
+        return name
+
+    suffix = 2
+    while True:
+        marker = f"~{suffix}"
+        candidate = name[: MAX_SHEET_NAME_LEN - len(marker)] + marker
+        if candidate.lower() not in used:
+            used.add(candidate.lower())
+            return candidate
+        suffix += 1
+
+
+def _sanitize_cell(value: Any) -> Any:
+    """
+    Coerce a single cell value into something safe for ``xlsxwriter``.
+
+    Quotes formula-like strings (defense against formula injection), 
stringifies
+    integers/floats Excel cannot represent precisely, renders temporal values 
as
+    ISO strings (timezones are not natively supported), and blanks out ``None``
+    and non-finite floats.
+    """
+    if value is None:
+        return ""
+    # bool is a subclass of int; preserve it before the numeric branches.
+    if isinstance(value, bool):
+        return value
+    if isinstance(value, str):
+        return f"'{value}" if value and value[0] in _FORMULA_PREFIXES else 
value
+    if isinstance(value, (datetime, date)):
+        return value.isoformat()
+    if isinstance(value, Decimal):
+        number = float(value)
+        return str(number) if abs(number) > _MAX_EXCEL_INT else number
+    if isinstance(value, numbers.Integral):
+        number = int(value)
+        return str(number) if abs(number) > _MAX_EXCEL_INT else number
+    if isinstance(value, numbers.Real):
+        number = float(value)
+        if not math.isfinite(number):
+            return ""
+        return str(number) if abs(number) > _MAX_EXCEL_INT else number

Review Comment:
   **Suggestion:** Large negative numerics are converted to strings to preserve 
precision, but those strings start with `-` and are not formula-escaped. Since 
`-` is in the formula-prefix denylist, this bypasses the intended 
formula-injection guard and can cause these cells to be interpreted as formulas 
instead of literal values; quote formula-like prefixes after numeric-to-string 
conversion too. [security]
   
   <details>
   <summary><b>Severity Level:</b> Major ⚠️</summary>
   
   ```mdx
   - ⚠️ Dashboard Excel export emits unquoted large negative numeric strings.
   - ⚠️ Formula-safety diverges from non-streaming exports in `utils.excel`.
   ```
   </details>
   <details>
   <summary><b>Steps of Reproduction ✅ </b></summary>
   
   ```mdx
   1. Trigger the dashboard Excel export endpoint `POST 
/api/v1/dashboard/<pk>/export_xlsx/`
   implemented in `superset/dashboards/api.py:1385-1435`, which enqueues the 
Celery task
   `export_dashboard_excel` (imported at `superset/dashboards/api.py:133`).
   
   2. The Celery worker executes `export_dashboard_excel` at
   `superset/tasks/export_dashboard_excel.py:64-120`, calls 
`_build_workbook(tmp_path,
   dashboard, active_data_mask, job_id)` at line 101, which iterates charts and 
for each one
   calls `_write_chart_sheets(writer, chart, dashboard.id, active_data_mask)` at
   `superset/tasks/export_dashboard_excel.py:20-31`.
   
   3. `_write_chart_sheets` at `superset/tasks/export_dashboard_excel.py:26-73` 
runs the
   chart query via `ChartDataCommand` and passes records to 
`StreamingXlsxWriter.add_sheet`
   at lines 69-73; for any column containing a very large negative integer or 
real (magnitude
   `> _MAX_EXCEL_INT`), `_sanitize_cell` in 
`superset/utils/excel_streaming.py:113-120`
   converts it to `str(number)` (for example `"-10000000000000000"`) and 
returns that string
   directly.
   
   4. Because `_sanitize_cell`'s formula-quoting branch for strings (lines 
106-107) only
   applies to original `str` inputs and is not re-run after numeric-to-string 
conversion,
   these large negative numerics are written by `worksheet.write_row` in 
`add_sheet` (line
   167) as unescaped strings beginning with `-`, diverging from the
   `superset.utils.excel.quote_formulas` behavior in 
`superset/utils/excel.py:45-59` and
   leaving formula-like prefixes unquoted in the streamed dashboard Excel 
export.
   ```
   </details>
   
   [![Fix in 
Cursor](https://new-codeant-butcket.s3.us-west-1.amazonaws.com/badges/fix-in-cursor-flat.svg)](https://app.codeant.ai/fix-in-ide?tool=cursor&prompt_id=4ad6200fb16e4cc09328814629c6929e&service=github&base_url=https%3A%2F%2Fgithub.com&org=apache&repo=apache%2Fsuperset)
 [![Fix in VSCode 
Claude](https://new-codeant-butcket.s3.us-west-1.amazonaws.com/badges/fix-in-vscode-claude-flat.svg)](https://app.codeant.ai/fix-in-ide?tool=vscode-claude&prompt_id=4ad6200fb16e4cc09328814629c6929e&service=github&base_url=https%3A%2F%2Fgithub.com&org=apache&repo=apache%2Fsuperset)
   
   *(Use Cmd/Ctrl + Click for best experience)*
   <details>
   <summary><b>Prompt for AI Agent 🤖 </b></summary>
   
   ```mdx
   This is a comment left during a code review.
   
   **Path:** superset/utils/excel_streaming.py
   **Line:** 113:120
   **Comment:**
        *Security: Large negative numerics are converted to strings to preserve 
precision, but those strings start with `-` and are not formula-escaped. Since 
`-` is in the formula-prefix denylist, this bypasses the intended 
formula-injection guard and can cause these cells to be interpreted as formulas 
instead of literal values; quote formula-like prefixes after numeric-to-string 
conversion too.
   
   Validate the correctness of the flagged issue. If correct, How can I resolve 
this? If you propose a fix, implement it and please make it concise.
   Once fix is implemented, also check other comments on the same PR, and ask 
user if the user wants to fix the rest of the comments as well. if said yes, 
then fetch all the comments validate the correctness and implement a minimal fix
   ```
   </details>
   <a 
href='https://app.codeant.ai/feedback?pr_url=https%3A%2F%2Fgithub.com%2Fapache%2Fsuperset%2Fpull%2F41133&comment_hash=1fe5a7ae8b295cbad838bd782772789192fa69bf4883bc2b7c102a4632f33d64&reaction=like'>👍</a>
 | <a 
href='https://app.codeant.ai/feedback?pr_url=https%3A%2F%2Fgithub.com%2Fapache%2Fsuperset%2Fpull%2F41133&comment_hash=1fe5a7ae8b295cbad838bd782772789192fa69bf4883bc2b7c102a4632f33d64&reaction=dislike'>👎</a>



##########
superset/utils/excel_streaming.py:
##########
@@ -0,0 +1,190 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+"""
+Streaming, constant-memory XLSX writer for multi-sheet dashboard exports.
+
+Unlike :mod:`superset.utils.excel`, which materializes a whole DataFrame in
+memory, this writer streams rows one at a time into an ``xlsxwriter`` workbook
+opened in ``constant_memory`` mode, so a dashboard with many large charts never
+holds more than one row per sheet in memory at once.
+"""
+
+from __future__ import annotations
+
+import math
+import numbers
+import re
+from collections.abc import Iterable, Sequence
+from datetime import date, datetime
+from decimal import Decimal
+from typing import Any
+
+import xlsxwriter
+
+from superset.utils.excel import NEUTRAL_DOCUMENT_PROPERTIES
+
+# Excel limits a sheet name to 31 characters and forbids these characters.
+MAX_SHEET_NAME_LEN = 31
+_INVALID_SHEET_CHARS_RE = re.compile(r"[\[\]:*?/\\]")
+# Excel reserves the sheet name "History" (case-insensitive).
+_RESERVED_SHEET_NAME = "history"
+
+# A worksheet holds at most 1,048,576 rows; one is reserved for the header.
+MAX_DATA_ROWS_PER_SHEET = 1_048_576 - 1
+
+# Leading characters that turn a cell into a formula in spreadsheet apps. 
Mirrors
+# superset.utils.excel.quote_formulas so streamed exports get the same guard.
+_FORMULA_PREFIXES = {"=", "+", "-", "@"}
+
+# Excel cannot represent integers beyond 10**15 without precision loss.
+_MAX_EXCEL_INT = 10**15
+
+
+def sanitize_sheet_name(raw: str, used: set[str]) -> str:
+    """
+    Produce a valid, unique Excel sheet name from ``raw``.
+
+    Replaces forbidden characters, strips surrounding apostrophes/whitespace,
+    avoids the reserved name "History", truncates to 31 characters, and
+    disambiguates case-insensitive collisions with ``~2``/``~3`` suffixes.
+    The chosen name (lower-cased) is added to ``used``.
+
+    :param raw: The desired sheet name (e.g. ``"42 - Sales by Region"``)
+    :param used: Lower-cased names already taken; mutated with the result
+    :returns: A sanitized, unique sheet name no longer than 31 characters
+    """
+    name = _INVALID_SHEET_CHARS_RE.sub("_", raw or "")
+    name = name.strip().strip("'").strip()
+    if not name:
+        name = "Sheet"
+    if name.lower() == _RESERVED_SHEET_NAME:
+        name = f"{name}_"
+    name = name[:MAX_SHEET_NAME_LEN]
+
+    if name.lower() not in used:
+        used.add(name.lower())
+        return name
+
+    suffix = 2
+    while True:
+        marker = f"~{suffix}"
+        candidate = name[: MAX_SHEET_NAME_LEN - len(marker)] + marker
+        if candidate.lower() not in used:
+            used.add(candidate.lower())
+            return candidate
+        suffix += 1
+
+
+def _sanitize_cell(value: Any) -> Any:
+    """
+    Coerce a single cell value into something safe for ``xlsxwriter``.
+
+    Quotes formula-like strings (defense against formula injection), 
stringifies
+    integers/floats Excel cannot represent precisely, renders temporal values 
as
+    ISO strings (timezones are not natively supported), and blanks out ``None``
+    and non-finite floats.
+    """
+    if value is None:
+        return ""
+    # bool is a subclass of int; preserve it before the numeric branches.
+    if isinstance(value, bool):
+        return value
+    if isinstance(value, str):
+        return f"'{value}" if value and value[0] in _FORMULA_PREFIXES else 
value
+    if isinstance(value, (datetime, date)):
+        return value.isoformat()
+    if isinstance(value, Decimal):
+        number = float(value)
+        return str(number) if abs(number) > _MAX_EXCEL_INT else number

Review Comment:
   **Suggestion:** Decimal handling can crash at runtime because it 
unconditionally does `float(value)` and then returns that float directly. For 
values like `Decimal("NaN")` this returns `nan` (which xlsxwriter rejects), and 
for very large decimals `float(value)` can raise `OverflowError`. Mirror the 
non-finite/overflow handling used for `numbers.Real` (or keep problematic 
decimals as strings) before writing. [type error]
   
   <details>
   <summary><b>Severity Level:</b> Major ⚠️</summary>
   
   ```mdx
   - ❌ Dashboard Excel export skips charts with NaN Decimal values.
   - ⚠️ Success email shows skipped charts instead of exported data.
   ```
   </details>
   <details>
   <summary><b>Steps of Reproduction ✅ </b></summary>
   
   ```mdx
   1. Trigger the dashboard Excel export endpoint `POST 
/api/v1/dashboard/<pk>/export_xlsx/`
   implemented in `superset/dashboards/api.py:1385-1435`, which enqueues the 
Celery task
   `export_dashboard_excel` (imported at `superset/dashboards/api.py:133`).
   
   2. The Celery worker runs `export_dashboard_excel` in
   `superset/tasks/export_dashboard_excel.py:64-120`, which constructs a 
temporary `.xlsx`
   path and calls `_build_workbook(tmp_path, dashboard, active_data_mask, 
job_id)` at
   `superset/tasks/export_dashboard_excel.py:101`.
   
   3. `_build_workbook` in `superset/tasks/export_dashboard_excel.py:10-39` 
iterates charts
   and calls `_write_chart_sheets(writer, chart, dashboard.id, 
active_data_mask)` at line 25;
   `_write_chart_sheets` at `superset/tasks/export_dashboard_excel.py:26-73` 
uses
   `ChartDataCommand` to execute the chart's query and then calls 
`writer.add_sheet(name,
   colnames, (_record_to_row(record, colnames) for record in data))` at lines 
69-73.
   
   4. `StreamingXlsxWriter.add_sheet` in 
`superset/utils/excel_streaming.py:144-171` writes
   each row with `worksheet.write_row(..., [_sanitize_cell(cell) for cell in 
row])` at line
   167; when any cell value is a `Decimal` such as `Decimal("NaN")` or an 
extremely large
   `Decimal("1E10000")`, `_sanitize_cell` at lines 110-112 executes `number = 
float(value)`
   which returns `nan` or raises `OverflowError`, causing 
`xlsxwriter.Workbook.write_row` to
   fail, the exception to propagate to `_write_chart_sheets` and be caught by 
the broad
   `except Exception` in `_build_workbook` (lines 24-30), resulting in that 
chart being
   skipped (no sheet written) and listed as skipped in the export success email.
   ```
   </details>
   
   [![Fix in 
Cursor](https://new-codeant-butcket.s3.us-west-1.amazonaws.com/badges/fix-in-cursor-flat.svg)](https://app.codeant.ai/fix-in-ide?tool=cursor&prompt_id=bdb0b4c9de274dad918edc5f29627645&service=github&base_url=https%3A%2F%2Fgithub.com&org=apache&repo=apache%2Fsuperset)
 [![Fix in VSCode 
Claude](https://new-codeant-butcket.s3.us-west-1.amazonaws.com/badges/fix-in-vscode-claude-flat.svg)](https://app.codeant.ai/fix-in-ide?tool=vscode-claude&prompt_id=bdb0b4c9de274dad918edc5f29627645&service=github&base_url=https%3A%2F%2Fgithub.com&org=apache&repo=apache%2Fsuperset)
   
   *(Use Cmd/Ctrl + Click for best experience)*
   <details>
   <summary><b>Prompt for AI Agent 🤖 </b></summary>
   
   ```mdx
   This is a comment left during a code review.
   
   **Path:** superset/utils/excel_streaming.py
   **Line:** 110:112
   **Comment:**
        *Type Error: Decimal handling can crash at runtime because it 
unconditionally does `float(value)` and then returns that float directly. For 
values like `Decimal("NaN")` this returns `nan` (which xlsxwriter rejects), and 
for very large decimals `float(value)` can raise `OverflowError`. Mirror the 
non-finite/overflow handling used for `numbers.Real` (or keep problematic 
decimals as strings) before writing.
   
   Validate the correctness of the flagged issue. If correct, How can I resolve 
this? If you propose a fix, implement it and please make it concise.
   Once fix is implemented, also check other comments on the same PR, and ask 
user if the user wants to fix the rest of the comments as well. if said yes, 
then fetch all the comments validate the correctness and implement a minimal fix
   ```
   </details>
   <a 
href='https://app.codeant.ai/feedback?pr_url=https%3A%2F%2Fgithub.com%2Fapache%2Fsuperset%2Fpull%2F41133&comment_hash=43c41b30b322d21aad12172a58aa3baac412f1adb35eaf411a22494d4aa85d4c&reaction=like'>👍</a>
 | <a 
href='https://app.codeant.ai/feedback?pr_url=https%3A%2F%2Fgithub.com%2Fapache%2Fsuperset%2Fpull%2F41133&comment_hash=43c41b30b322d21aad12172a58aa3baac412f1adb35eaf411a22494d4aa85d4c&reaction=dislike'>👎</a>



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to