andygrove commented on code in PR #4461:
URL: https://github.com/apache/datafusion-comet/pull/4461#discussion_r3319454646
##########
docs/source/contributor-guide/spark_expressions_support.md:
##########
@@ -523,40 +523,109 @@
### string_funcs
- [x] ascii
+ - Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8.
+ - Spark 3.5.8 (audited 2026-05-27): baseline. `StringType -> IntegerType`;
`nullSafeEval` returns `codePointAt(0)` of the first char, or `0` for the empty
string. Wired via `CometScalarFunction("ascii")` and resolved to DataFusion
`ascii` (`chars().next() as i32`); first-code-point semantics match for ASCII,
BMP, and supplementary code points.
+ - Spark 4.0.1 (audited 2026-05-27): `inputTypes` widened to
`StringTypeWithCollation(supportsTrimCollation = true)`; behaviour unchanged
for `UTF8_BINARY`. Comet does not propagate collation, so non-default
collations may diverge silently.
- [ ] base64
- [x] bit_length
+ - Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8.
+ - Spark 3.5.8 (audited 2026-05-27): baseline. `(StringType|BinaryType) ->
IntegerType`; eval returns `numBytes * 8` for strings and `.length * 8` for
binary.
+ - Spark 4.0.1 (audited 2026-05-27): `inputTypes` widened to
`StringTypeWithCollation(supportsTrimCollation = true)`; semantics unchanged.
+ - Known limitation: wired as a raw `CometScalarFunction("bit_length")` with
no `BinaryType` guard. DataFusion's `BitLengthFunc` signature only accepts
string types, so `bit_length(<binary>)` execute-fails on the native side
instead of falling back cleanly
(https://github.com/apache/datafusion-comet/issues/4464).
- [x] btrim
+ - Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8.
+ - Spark 3.5.8 (audited 2026-05-27): baseline. `StringTrimBoth` is
`RuntimeReplaceable` and rewritten to `StringTrim(srcStr, trimStr)` before
serde runs, so the explicit `CometScalarFunction("btrim")` mapping is
unreachable.
+ - Spark 4.0.1 (audited 2026-05-27): `StringTrim` (the rewrite target) routes
through `CollationSupport.StringTrim.exec` and uses
`StringTypeNonCSAICollation(supportsTrimCollation = true)`; semantics unchanged
for `UTF8_BINARY`. Non-default collations may diverge in Comet.
Review Comment:
I filed https://github.com/apache/datafusion-comet/issues/4496 for doing a
follow on audit focussing on collation issues once this PR is merged
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]