Copilot commented on code in PR #37245:
URL: https://github.com/apache/superset/pull/37245#discussion_r2713911223
##########
superset/commands/report/execute.py:
##########
@@ -436,6 +436,20 @@ def _get_pdf(self) -> bytes:
return pdf
+ def _ensure_utf8_bom(self, csv_data: bytes, encoding: str) -> bytes:
+ """
+ Ensure CSV bytes contain UTF-8 BOM when encoding is utf-8-sig.
+ Avoid double BOM.
+ """
+ if not csv_data:
+ return csv_data
+
+ enc = (encoding or "").lower().replace("_", "-")
+ if enc in ("utf-8-sig", "utf8-sig"):
Review Comment:
The encoding normalization at line 447 replaces underscores with hyphens,
converting both "utf_8_sig" and "utf-8-sig" to "utf-8-sig". Therefore, checking
for "utf8-sig" (without any separator) in line 448 is unnecessary after this
normalization.
Consider simplifying to just check for "utf-8-sig":
```python
enc = (encoding or "").lower().replace("_", "-")
if enc == "utf-8-sig":
```
Alternatively, if you want to handle "utf8sig" (no separator), you could
check for both "utf-8-sig" and "utf8sig", but the current check for "utf8-sig"
(with hyphen but no separators elsewhere) won't match any real input after
normalization.
```suggestion
if enc == "utf-8-sig":
```
##########
superset/commands/report/execute.py:
##########
@@ -436,6 +436,20 @@ def _get_pdf(self) -> bytes:
return pdf
+ def _ensure_utf8_bom(self, csv_data: bytes, encoding: str) -> bytes:
+ """
+ Ensure CSV bytes contain UTF-8 BOM when encoding is utf-8-sig.
+ Avoid double BOM.
+ """
+ if not csv_data:
+ return csv_data
+
+ enc = (encoding or "").lower().replace("_", "-")
+ if enc in ("utf-8-sig", "utf8-sig"):
+ if not csv_data.startswith(b"\xef\xbb\xbf"):
+ return b"\xef\xbb\xbf" + csv_data
+ return csv_data
Review Comment:
The new method _ensure_utf8_bom lacks test coverage. Given that this
addresses a specific encoding issue that could lead to data corruption (garbled
characters in Excel), it would be valuable to add a test that verifies:
1. When CSV_EXPORT.encoding is "utf-8-sig", the returned CSV bytes start
with the UTF-8 BOM (b"\xef\xbb\xbf")
2. When a BOM already exists, it's not duplicated
3. When the encoding is not utf-8-sig, no BOM is added
Consider adding a test in tests/integration_tests/reports/commands_tests.py
similar to test_email_chart_report_schedule_with_csv but with assertions on the
BOM presence.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]