aminghadersohi commented on code in PR #41533:
URL: https://github.com/apache/superset/pull/41533#discussion_r3500757727


##########
superset/result_set.py:
##########
@@ -60,7 +60,11 @@ def dedup(l: list[str], suffix: str = "__", case_sensitive: 
bool = True) -> list
 
 
 def stringify(obj: Any) -> str:
-    return json.dumps(obj, default=json.json_iso_dttm_ser)
+    # ensure_ascii=False so non-ASCII characters in array/struct/JSON column
+    # values (e.g. Cyrillic or CJK text from array_agg) render verbatim in the
+    # result grid instead of as \uXXXX escape sequences. This only affects the
+    # query result payload, not metadata persisted to the database.
+    return json.dumps(obj, default=json.json_iso_dttm_ser, ensure_ascii=False)

Review Comment:
   MEDIUM: `simplejson` with `ensure_ascii=True` escapes lone surrogates 
(U+D800–U+DFFF) as `\ud800` sequences and returns successfully. With 
`ensure_ascii=False`, the C extension calls `PyUnicode_AsUTF8String` on the raw 
code points, which raises `UnicodeEncodeError` for lone surrogates — surrogates 
that can appear in data from CESU-8–encoded MySQL/SQL Server clients. The inner 
`except TypeError` at line 87 of `stringify_values()` does not catch 
`UnicodeEncodeError`, so a dict or list column containing a lone-surrogate 
string would now raise instead of falling back to `str(val)`. Suggest widening 
that catch:
   ```python
   except (TypeError, UnicodeEncodeError):
       obj[...] = str(val)
   ```



##########
tests/unit_tests/result_set_test.py:
##########
@@ -522,6 +522,37 @@ def 
test_clickhouse_json_column_in_pa_table_is_valid_json() -> None:
     assert parsed1 == {"e": 5}
 
 
+def test_stringify_values_preserves_non_ascii_characters() -> None:
+    """
+    Non-ASCII text inside array/struct/JSON column values (e.g. the Cyrillic 
and
+    CJK strings produced by ``array_agg``) must render verbatim in the result
+    grid, not as ``\\uXXXX`` escape sequences. Regression test for #19388 and
+    #22904, where such values were displayed as "unicode gibberish".
+    """
+    data = np.array(
+        [
+            ["Лонгсливы", "Свитшоты"],

Review Comment:
   NIT: the test data covers Cyrillic (U+0400) and CJK (U+4E00–U+9FFF), both 
BMP. Adding a supplementary-plane entry — e.g. `["🏆", "🌍"]` — and asserting 
`"\\u" not in result[3]` would pin simplejson's behavior for 4-byte codepoints 
across the fix.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to