andygrove opened a new issue, #4463:
URL: https://github.com/apache/datafusion-comet/issues/4463

   ## Describe the bug
   
   `translate` is wired as `CometScalarFunction("translate")` and currently 
reports `Compatible`, but DataFusion's `translate` diverges from Spark in two 
ways:
   
   1. **Grapheme vs code-point semantics.** DataFusion iterates over Unicode 
graphemes; Spark uses code points (via `Character.charCount`). For 
supplementary BMP code points these match, but for multi-code-point graphemes 
(combining marks, ZWJ sequences such as flag emoji) the two implementations 
disagree.
   2. **NUL byte in the `to` argument.** Spark's `StringTranslate.buildDict` 
treats any character mapped to U+0000 in `to` as a deletion. DataFusion 
substitutes U+0000 instead.
   
   Surfaced by the string-expressions audit in apache/datafusion-comet#4461.
   
   ## Steps to reproduce
   
   ```sql
   -- (1) grapheme vs code point: combining mark
   SELECT translate(concat('e', char(0x0301)), 'e', 'a');
   
   -- (2) U+0000 deletion: expected to delete 'b' under Spark
   SELECT translate('abc', 'b', char(0));
   ```
   
   Spark deletes the matched character in the second query; Comet substitutes a 
NUL character. Spark's per-code-point translation and Comet's grapheme-based 
translation diverge for combining-mark inputs.
   
   ## Expected behavior
   
   Match Spark behavior, or downgrade `translate` to `Incompatible(Some(...))` 
so the non-ASCII path falls back unless explicitly enabled.
   
   ## Additional context
   
   - Comet wiring: `QueryPlanSerde.scala` -> `classOf[StringTranslate] -> 
CometScalarFunction("translate")`
   - Spark reference: `UTF8String.translate(dict)` with 
`StringTranslate.buildDict`
   - DataFusion impl: `datafusion-functions::unicode::translate`
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to