andygrove commented on code in PR #4461:
URL: https://github.com/apache/datafusion-comet/pull/4461#discussion_r3319433854


##########
docs/source/contributor-guide/spark_expressions_support.md:
##########
@@ -523,40 +523,109 @@
 ### string_funcs
 
 - [x] ascii
+  - Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8.
+  - Spark 3.5.8 (audited 2026-05-27): baseline. `StringType -> IntegerType`; 
`nullSafeEval` returns `codePointAt(0)` of the first char, or `0` for the empty 
string. Wired via `CometScalarFunction("ascii")` and resolved to DataFusion 
`ascii` (`chars().next() as i32`); first-code-point semantics match for ASCII, 
BMP, and supplementary code points.
+  - Spark 4.0.1 (audited 2026-05-27): `inputTypes` widened to 
`StringTypeWithCollation(supportsTrimCollation = true)`; behaviour unchanged 
for `UTF8_BINARY`. Comet does not propagate collation, so non-default 
collations may diverge silently.
 - [ ] base64
 - [x] bit_length
+  - Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8.
+  - Spark 3.5.8 (audited 2026-05-27): baseline. `(StringType|BinaryType) -> 
IntegerType`; eval returns `numBytes * 8` for strings and `.length * 8` for 
binary.
+  - Spark 4.0.1 (audited 2026-05-27): `inputTypes` widened to 
`StringTypeWithCollation(supportsTrimCollation = true)`; semantics unchanged.
+  - Known limitation: wired as a raw `CometScalarFunction("bit_length")` with 
no `BinaryType` guard. DataFusion's `BitLengthFunc` signature only accepts 
string types, so `bit_length(<binary>)` execute-fails on the native side 
instead of falling back cleanly 
(https://github.com/apache/datafusion-comet/issues/4464).
 - [x] btrim
+  - Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8.
+  - Spark 3.5.8 (audited 2026-05-27): baseline. `StringTrimBoth` is 
`RuntimeReplaceable` and rewritten to `StringTrim(srcStr, trimStr)` before 
serde runs, so the explicit `CometScalarFunction("btrim")` mapping is 
unreachable.
+  - Spark 4.0.1 (audited 2026-05-27): `StringTrim` (the rewrite target) routes 
through `CollationSupport.StringTrim.exec` and uses 
`StringTypeNonCSAICollation(supportsTrimCollation = true)`; semantics unchanged 
for `UTF8_BINARY`. Non-default collations may diverge in Comet.
 - [x] char
+  - Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8.
+  - Spark 3.5.8 (audited 2026-05-27): baseline. `Chr(LongType) -> StringType`; 
`lon < 0` returns `""`, else `((lon & 0xFF) as char).toString` (so `chr(256)` 
and `chr(0)` both return `\u0000`).
+  - Spark 4.0.1 (audited 2026-05-27): semantics unchanged; `NullIntolerant` 
trait replaced by `override def nullIntolerant: Boolean = true`. Resolves 
natively to `datafusion_spark::function::string::char::CharFunc`, which mirrors 
Spark's negative-input and `& 0xFF` semantics.
 - [x] char_length
+  - Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8.
+  - Spark 3.5.8 (audited 2026-05-27): registry alias of `Length`. Same support 
as `length`.
+  - Spark 4.0.1 (audited 2026-05-27): unchanged alias of `Length`.
 - [x] character_length
+  - Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8.
+  - Spark 3.5.8 (audited 2026-05-27): registry alias of `Length`. Same support 
as `length`.
+  - Spark 4.0.1 (audited 2026-05-27): unchanged alias of `Length`.
 - [x] chr
+  - Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8.
+  - Spark 3.5.8 (audited 2026-05-27): registry alias of `Chr`. Same support as 
`char`.
+  - Spark 4.0.1 (audited 2026-05-27): unchanged alias of `Chr`.
 - [ ] collate
 - [ ] collation
 - [x] concat_ws
+  - Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8.
+  - Spark 3.5.8 (audited 2026-05-27): baseline. `Seq[Expression] -> 
StringType`; NULL separator yields NULL, NULL element values are skipped, 
children can be `StringType` or `ArrayType(StringType)`. Comet serde rewrites a 
NULL-literal separator to a NULL of the result type and bails out on 
all-foldable inputs so Spark's `ConstantFolding` handles them; otherwise 
delegates to DataFusion `concat_ws`.
+  - Spark 4.0.1 (audited 2026-05-27): `inputTypes` widened to 
`StringTypeWithCollation` / `AbstractArrayType`; `dataType` becomes 
`children.head.dataType` (collation-derived). Semantics unchanged for 
`UTF8_BINARY`.
 - [x] contains
+  - Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8.
+  - Spark 3.5.8 (audited 2026-05-27): baseline. `UTF8String.contains` on 
`StringType`; the parser routes `(BinaryType, BinaryType)` to 
`BinaryPredicate`, so Comet only ever sees the String form.
+  - Spark 4.0.1 (audited 2026-05-27): routes through 
`CollationSupport.Contains.exec(..., collationId)`; behaviour identical for 
`UTF8_BINARY`. Non-default collations not honoured by Comet.
 - [x] decode
+  - Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8.
+  - Spark 3.5.8 (audited 2026-05-27): baseline. `StringDecode(bin, charset)` 
evaluated directly; invalid sequences silently substitute replacement 
characters via `new String(bytes, charset)`.
+  - Spark 4.0.1 (audited 2026-05-27): refactored to `RuntimeReplaceable` whose 
`replacement` is a `StaticInvoke(StringDecode.decode, bin, charset, 
legacyCharsets, legacyErrorAction)`; the 4-arg form raises on malformed input 
unless legacy flags are set.
+  - Known limitations: Comet handles `decode` via 
`CommonStringExprs.stringDecode` from the version shims (no 
`CometExpressionSerde[StringDecode]` registration, so the function does not 
surface in the auto-generated compatibility docs: 
https://github.com/apache/datafusion-comet/issues/4466). Only literal `charset 
= 'utf-8'` (case-insensitive) is supported; everything else falls back. The 
Spark 4.0 `legacyCharsets` / `legacyErrorAction` flags are ignored: Comet 
always lowers to `Cast(bin, StringType, TRY)`, so invalid UTF-8 yields NULL 
where Spark 3.x substitutes replacement characters and Spark 4.0 (non-legacy) 
raises (https://github.com/apache/datafusion-comet/issues/4465).
 - [ ] elt
 - [ ] encode
 - [x] endswith
+  - Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8.
+  - Spark 3.5.8 (audited 2026-05-27): baseline. `UTF8String.endsWith` on 
`StringType`; binary form routed to `BinaryPredicate` before Comet.
+  - Spark 4.0.1 (audited 2026-05-27): routes through 
`CollationSupport.EndsWith.exec`; semantics unchanged for `UTF8_BINARY`.
 - [ ] find_in_set
 - [ ] format_number
 - [ ] format_string
 - [x] initcap
+  - Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8.
+  - Spark 3.5.8 (audited 2026-05-27): baseline. 
`string.toLowerCase.toTitleCase` on `UTF8String`; word boundary is Java 
`Character.isWhitespace`. Comet routes to DataFusion `initcap`, which splits on 
`!is_alphanumeric()` (hyphens, apostrophes, and punctuation all split words), 
so Comet is unconditionally `Incompatible` 
(https://github.com/apache/datafusion-comet/issues/1052).
+  - Spark 4.0.1 (audited 2026-05-27): routes through 
`CollationSupport.InitCap.exec` (collation- and ICU-aware) and propagates 
`child.dataType`. Comet ignores collation; 3.x divergences persist plus 
collation/ICU mismatches.
 - [x] instr
+  - Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8.
+  - Spark 3.5.8 (audited 2026-05-27): baseline. `StringInstr(str, substr) -> 
IntegerType`; returns `string.indexOf(sub, 0) + 1` (1-based, 0 when not found, 
1 on empty substring). Resolves to DataFusion `strpos` (alias `instr`) with 
matching semantics.
+  - Spark 4.0.1 (audited 2026-05-27): routes through 
`CollationSupport.StringInstr.exec`; semantics unchanged for `UTF8_BINARY`.
 - [ ] is_valid_utf8
 - [x] lcase
+  - Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8.
+  - Spark 3.5.8 (audited 2026-05-27): registry alias of `Lower`. Same support 
as `lower`.
+  - Spark 4.0.1 (audited 2026-05-27): unchanged alias of `Lower`.
 - [x] left
+  - Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8.
+  - Spark 3.5.8 (audited 2026-05-27): baseline. `RuntimeReplaceable` with 
`replacement = Substring(str, Literal(1), len)`; accepts `StringType` or 
`BinaryType` plus `IntegerType`. Comet serde rewrites to a `Substring` proto 
with `start=1, len=lenValue`, requires `len` to be a `Literal`, and falls back 
otherwise.
+  - Spark 4.0.1 (audited 2026-05-27): `inputTypes` widened with 
`StringTypeWithCollation`; behaviour unchanged for `UTF8_BINARY`.
+  - Known limitation: the literal-only `len` restriction is enforced inside 
`convert` via `withInfo` rather than declared in `getSupportLevel`, so EXPLAIN 
surfaces it only at conversion time.

Review Comment:
   We should fix this



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to