andygrove opened a new issue, #4465:
URL: https://github.com/apache/datafusion-comet/issues/4465

   ## Describe the bug
   
   Spark 4.0 refactored `StringDecode` from a `BinaryExpression` to a 
`RuntimeReplaceable` whose `replacement` is `StaticInvoke(StringDecode.decode, 
bin, charset, legacyCharsets, legacyErrorAction)`. The two new boolean 
arguments control malformed-character handling: with `legacyErrorAction = 
true`, Spark substitutes replacement characters for invalid UTF-8 sequences 
(matching the Spark 3.x behavior); with `legacyErrorAction = false` (the 
default), Spark raises `QueryExecutionErrors.malformedCharacterCoding(...)`.
   
   Comet's Spark 4.0 shim 
(`spark/src/main/spark-4.0/org/apache/comet/shims/CometExprShim.scala`) 
destructures the `StaticInvoke` arguments and discards both flags, then routes 
through `CommonStringExprs.stringDecode` which always lowers to `Cast(bin, 
StringType, TRY)`. The Cast TRY path produces NULL on invalid UTF-8 in all 
cases. That means:
   
   - Under Spark 4.0 default mode (`legacyErrorAction = false`): Spark raises, 
Comet returns NULL.
   - Under Spark 4.0 legacy mode (`legacyErrorAction = true`): Spark 
substitutes replacement characters, Comet returns NULL.
   - Under Spark 3.x: Spark substitutes replacement characters, Comet returns 
NULL.
   
   Surfaced by the string-expressions audit in apache/datafusion-comet#4461.
   
   ## Steps to reproduce
   
   ```sql
   SET spark.sql.legacy.javaCharsets = true;
   SELECT decode(X'FF', 'UTF-8');
   ```
   
   Spark 3.x: returns `?` (Unicode replacement).
   Spark 4.0 (legacy mode): same as 3.x.
   Spark 4.0 (default mode): raises `MALFORMED_CHARACTER_CODING`.
   Comet: returns NULL in all three cases.
   
   ## Expected behavior
   
   Honor `legacyCharsets` / `legacyErrorAction` when running under Spark 4.0+. 
At minimum, the flags should be propagated through the proto so the native impl 
can choose between the substitute/throw/null modes.
   
   ## Additional context
   
   - Shim location: 
`spark/src/main/spark-4.0/org/apache/comet/shims/CometExprShim.scala` (and 
`spark-4.1`, `spark-4.2`)
   - Helper: `CommonStringExprs.stringDecode` in 
`spark/src/main/scala/org/apache/comet/serde/strings.scala`
   - Related: #4465 (decode not surfaced in compatibility docs)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to