andygrove commented on code in PR #4461:
URL: https://github.com/apache/datafusion-comet/pull/4461#discussion_r3319433854
##########
docs/source/contributor-guide/spark_expressions_support.md:
##########
@@ -523,40 +523,109 @@
### string_funcs
- [x] ascii
+ - Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8.
+ - Spark 3.5.8 (audited 2026-05-27): baseline. `StringType -> IntegerType`;
`nullSafeEval` returns `codePointAt(0)` of the first char, or `0` for the empty
string. Wired via `CometScalarFunction("ascii")` and resolved to DataFusion
`ascii` (`chars().next() as i32`); first-code-point semantics match for ASCII,
BMP, and supplementary code points.
+ - Spark 4.0.1 (audited 2026-05-27): `inputTypes` widened to
`StringTypeWithCollation(supportsTrimCollation = true)`; behaviour unchanged
for `UTF8_BINARY`. Comet does not propagate collation, so non-default
collations may diverge silently.
- [ ] base64
- [x] bit_length
+ - Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8.
+ - Spark 3.5.8 (audited 2026-05-27): baseline. `(StringType|BinaryType) ->
IntegerType`; eval returns `numBytes * 8` for strings and `.length * 8` for
binary.
+ - Spark 4.0.1 (audited 2026-05-27): `inputTypes` widened to
`StringTypeWithCollation(supportsTrimCollation = true)`; semantics unchanged.
+ - Known limitation: wired as a raw `CometScalarFunction("bit_length")` with
no `BinaryType` guard. DataFusion's `BitLengthFunc` signature only accepts
string types, so `bit_length(<binary>)` execute-fails on the native side
instead of falling back cleanly
(https://github.com/apache/datafusion-comet/issues/4464).
- [x] btrim
+ - Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8.
+ - Spark 3.5.8 (audited 2026-05-27): baseline. `StringTrimBoth` is
`RuntimeReplaceable` and rewritten to `StringTrim(srcStr, trimStr)` before
serde runs, so the explicit `CometScalarFunction("btrim")` mapping is
unreachable.
+ - Spark 4.0.1 (audited 2026-05-27): `StringTrim` (the rewrite target) routes
through `CollationSupport.StringTrim.exec` and uses
`StringTypeNonCSAICollation(supportsTrimCollation = true)`; semantics unchanged
for `UTF8_BINARY`. Non-default collations may diverge in Comet.
- [x] char
+ - Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8.
+ - Spark 3.5.8 (audited 2026-05-27): baseline. `Chr(LongType) -> StringType`;
`lon < 0` returns `""`, else `((lon & 0xFF) as char).toString` (so `chr(256)`
and `chr(0)` both return `\u0000`).
+ - Spark 4.0.1 (audited 2026-05-27): semantics unchanged; `NullIntolerant`
trait replaced by `override def nullIntolerant: Boolean = true`. Resolves
natively to `datafusion_spark::function::string::char::CharFunc`, which mirrors
Spark's negative-input and `& 0xFF` semantics.
- [x] char_length
+ - Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8.
+ - Spark 3.5.8 (audited 2026-05-27): registry alias of `Length`. Same support
as `length`.
+ - Spark 4.0.1 (audited 2026-05-27): unchanged alias of `Length`.
- [x] character_length
+ - Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8.
+ - Spark 3.5.8 (audited 2026-05-27): registry alias of `Length`. Same support
as `length`.
+ - Spark 4.0.1 (audited 2026-05-27): unchanged alias of `Length`.
- [x] chr
+ - Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8.
+ - Spark 3.5.8 (audited 2026-05-27): registry alias of `Chr`. Same support as
`char`.
+ - Spark 4.0.1 (audited 2026-05-27): unchanged alias of `Chr`.
- [ ] collate
- [ ] collation
- [x] concat_ws
+ - Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8.
+ - Spark 3.5.8 (audited 2026-05-27): baseline. `Seq[Expression] ->
StringType`; NULL separator yields NULL, NULL element values are skipped,
children can be `StringType` or `ArrayType(StringType)`. Comet serde rewrites a
NULL-literal separator to a NULL of the result type and bails out on
all-foldable inputs so Spark's `ConstantFolding` handles them; otherwise
delegates to DataFusion `concat_ws`.
+ - Spark 4.0.1 (audited 2026-05-27): `inputTypes` widened to
`StringTypeWithCollation` / `AbstractArrayType`; `dataType` becomes
`children.head.dataType` (collation-derived). Semantics unchanged for
`UTF8_BINARY`.
- [x] contains
+ - Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8.
+ - Spark 3.5.8 (audited 2026-05-27): baseline. `UTF8String.contains` on
`StringType`; the parser routes `(BinaryType, BinaryType)` to
`BinaryPredicate`, so Comet only ever sees the String form.
+ - Spark 4.0.1 (audited 2026-05-27): routes through
`CollationSupport.Contains.exec(..., collationId)`; behaviour identical for
`UTF8_BINARY`. Non-default collations not honoured by Comet.
- [x] decode
+ - Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8.
+ - Spark 3.5.8 (audited 2026-05-27): baseline. `StringDecode(bin, charset)`
evaluated directly; invalid sequences silently substitute replacement
characters via `new String(bytes, charset)`.
+ - Spark 4.0.1 (audited 2026-05-27): refactored to `RuntimeReplaceable` whose
`replacement` is a `StaticInvoke(StringDecode.decode, bin, charset,
legacyCharsets, legacyErrorAction)`; the 4-arg form raises on malformed input
unless legacy flags are set.
+ - Known limitations: Comet handles `decode` via
`CommonStringExprs.stringDecode` from the version shims (no
`CometExpressionSerde[StringDecode]` registration, so the function does not
surface in the auto-generated compatibility docs:
https://github.com/apache/datafusion-comet/issues/4466). Only literal `charset
= 'utf-8'` (case-insensitive) is supported; everything else falls back. The
Spark 4.0 `legacyCharsets` / `legacyErrorAction` flags are ignored: Comet
always lowers to `Cast(bin, StringType, TRY)`, so invalid UTF-8 yields NULL
where Spark 3.x substitutes replacement characters and Spark 4.0 (non-legacy)
raises (https://github.com/apache/datafusion-comet/issues/4465).
- [ ] elt
- [ ] encode
- [x] endswith
+ - Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8.
+ - Spark 3.5.8 (audited 2026-05-27): baseline. `UTF8String.endsWith` on
`StringType`; binary form routed to `BinaryPredicate` before Comet.
+ - Spark 4.0.1 (audited 2026-05-27): routes through
`CollationSupport.EndsWith.exec`; semantics unchanged for `UTF8_BINARY`.
- [ ] find_in_set
- [ ] format_number
- [ ] format_string
- [x] initcap
+ - Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8.
+ - Spark 3.5.8 (audited 2026-05-27): baseline.
`string.toLowerCase.toTitleCase` on `UTF8String`; word boundary is Java
`Character.isWhitespace`. Comet routes to DataFusion `initcap`, which splits on
`!is_alphanumeric()` (hyphens, apostrophes, and punctuation all split words),
so Comet is unconditionally `Incompatible`
(https://github.com/apache/datafusion-comet/issues/1052).
+ - Spark 4.0.1 (audited 2026-05-27): routes through
`CollationSupport.InitCap.exec` (collation- and ICU-aware) and propagates
`child.dataType`. Comet ignores collation; 3.x divergences persist plus
collation/ICU mismatches.
- [x] instr
+ - Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8.
+ - Spark 3.5.8 (audited 2026-05-27): baseline. `StringInstr(str, substr) ->
IntegerType`; returns `string.indexOf(sub, 0) + 1` (1-based, 0 when not found,
1 on empty substring). Resolves to DataFusion `strpos` (alias `instr`) with
matching semantics.
+ - Spark 4.0.1 (audited 2026-05-27): routes through
`CollationSupport.StringInstr.exec`; semantics unchanged for `UTF8_BINARY`.
- [ ] is_valid_utf8
- [x] lcase
+ - Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8.
+ - Spark 3.5.8 (audited 2026-05-27): registry alias of `Lower`. Same support
as `lower`.
+ - Spark 4.0.1 (audited 2026-05-27): unchanged alias of `Lower`.
- [x] left
+ - Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8.
+ - Spark 3.5.8 (audited 2026-05-27): baseline. `RuntimeReplaceable` with
`replacement = Substring(str, Literal(1), len)`; accepts `StringType` or
`BinaryType` plus `IntegerType`. Comet serde rewrites to a `Substring` proto
with `start=1, len=lenValue`, requires `len` to be a `Literal`, and falls back
otherwise.
+ - Spark 4.0.1 (audited 2026-05-27): `inputTypes` widened with
`StringTypeWithCollation`; behaviour unchanged for `UTF8_BINARY`.
+ - Known limitation: the literal-only `len` restriction is enforced inside
`convert` via `withInfo` rather than declared in `getSupportLevel`, so EXPLAIN
surfaces it only at conversion time.
Review Comment:
We should fix this
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]