Copilot commented on code in PR #37245:
URL: https://github.com/apache/superset/pull/37245#discussion_r2713911223


##########
superset/commands/report/execute.py:
##########
@@ -436,6 +436,20 @@ def _get_pdf(self) -> bytes:
 
         return pdf
 
+    def _ensure_utf8_bom(self, csv_data: bytes, encoding: str) -> bytes:
+        """
+        Ensure CSV bytes contain UTF-8 BOM when encoding is utf-8-sig.
+        Avoid double BOM.
+        """
+        if not csv_data:
+            return csv_data
+
+        enc = (encoding or "").lower().replace("_", "-")
+        if enc in ("utf-8-sig", "utf8-sig"):

Review Comment:
   The encoding normalization at line 447 replaces underscores with hyphens, 
converting both "utf_8_sig" and "utf-8-sig" to "utf-8-sig". Therefore, checking 
for "utf8-sig" (without any separator) in line 448 is unnecessary after this 
normalization.
   
   Consider simplifying to just check for "utf-8-sig":
   ```python
   enc = (encoding or "").lower().replace("_", "-")
   if enc == "utf-8-sig":
   ```
   
   Alternatively, if you want to handle "utf8sig" (no separator), you could 
check for both "utf-8-sig" and "utf8sig", but the current check for "utf8-sig" 
(with hyphen but no separators elsewhere) won't match any real input after 
normalization.
   ```suggestion
           if enc == "utf-8-sig":
   ```



##########
superset/commands/report/execute.py:
##########
@@ -436,6 +436,20 @@ def _get_pdf(self) -> bytes:
 
         return pdf
 
+    def _ensure_utf8_bom(self, csv_data: bytes, encoding: str) -> bytes:
+        """
+        Ensure CSV bytes contain UTF-8 BOM when encoding is utf-8-sig.
+        Avoid double BOM.
+        """
+        if not csv_data:
+            return csv_data
+
+        enc = (encoding or "").lower().replace("_", "-")
+        if enc in ("utf-8-sig", "utf8-sig"):
+            if not csv_data.startswith(b"\xef\xbb\xbf"):
+                return b"\xef\xbb\xbf" + csv_data
+        return csv_data

Review Comment:
   The new method _ensure_utf8_bom lacks test coverage. Given that this 
addresses a specific encoding issue that could lead to data corruption (garbled 
characters in Excel), it would be valuable to add a test that verifies:
   
   1. When CSV_EXPORT.encoding is "utf-8-sig", the returned CSV bytes start 
with the UTF-8 BOM (b"\xef\xbb\xbf")
   2. When a BOM already exists, it's not duplicated
   3. When the encoding is not utf-8-sig, no BOM is added
   
   Consider adding a test in tests/integration_tests/reports/commands_tests.py 
similar to test_email_chart_report_schedule_with_csv but with assertions on the 
BOM presence.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to