andygrove opened a new issue, #4500: URL: https://github.com/apache/datafusion-comet/issues/4500
Tracking issue for follow-up work surfaced by the math expression audit in #4486. Each item below is either a support-level / serde correctness fix that the audit deliberately deferred, or a coverage gap the audit documented but did not implement. ## High priority ### 1. Lift `convert`-time `withInfo` fallbacks in `arithmetic.scala` into `getSupportLevel` `CometAdd`, `CometSubtract`, `CometMultiply`, `CometDivide`, `CometIntegralDivide`, `CometRemainder`, `CometUnaryMinus`, and `CometRound` (`spark/src/main/scala/org/apache/comet/serde/arithmetic.scala:94-271, 300-321, 355`) currently bail out from `convert` with `withInfo(...) + return None` for cases that are knowable from the expression alone (unsupported datatype, `EvalMode.TRY` on `Remainder`, negative-scale `Decimal` on `Round`, `Float` / `Double` on `Round`, etc.). The `audit-comet-expression` skill requires expression-shape restrictions to be declared as `Unsupported(Some(reason))` / `Incompatible(Some(reason))` branches in `getSupportLevel` so that: - EXPLAIN surfaces the reason at planning time - the auto-generated compatibility doc picks them up - the dispatcher can route around them This was deferred from #4486 because `arithmetic.scala` is load-bearing and the fix is mechanical but wide. ### 2. `UnaryPositive` has no serde on Spark 3.4 / 3.5 Projections containing `+col` silently disable Comet for the enclosing projection on Spark 3.4 / 3.5 because there is no `CometUnaryPositive` registration. On Spark 4.0+ `UnaryPositive` is `RuntimeReplaceable` and the optimizer removes it before serde, so the gap is transparent there. Wire a passthrough serde for the 3.4 / 3.5 shim path so the projection stays native. ### 3. `hex` / `unhex` Spark 4.x collation propagation Spark 4.x widens `Hex` / `Unhex` input types to `StringTypeWithCollation` and preserves collation in `dataType`. `CometHex` / `CometUnhex` pass `expr.dataType` to the native `SparkHex` / `spark_unhex` UDFs, which always return `Utf8`, so collation does not propagate. Per the audit skill, the support level should be `Incompatible(Some(reason))` for non-default collations rather than the current `Compatible`. Cross-reference #2190 / #4496. ## Medium priority (robustness / coverage) ### 4. `greatest` / `least` do not gate input types Registered as bare `CometScalarFunction(\"greatest\")` / `CometScalarFunction(\"least\")` in `QueryPlanSerde.scala`. Spark accepts interval inputs (and a few Spark-specific orderings) that the native UDF may not handle; there is no explicit fallback path, so a runtime error from the native UDF is the failure mode. Add an input-type guard that falls back to Spark for the unsupported cases. ### 5. `ceil(expr, scale)` / `floor(expr, scale)` (RoundCeil / RoundFloor) not wired The two-argument forms always fall back to Spark. Wire them via the same path as `Round`. ### 6. `BRound` (HALF_EVEN rounding) not wired Listed as `[ ] bround` in the support doc. Banker's rounding is independent of the BigDecimal-via-toString issue that blocks Float / Double `Round`, so the integer / decimal paths could be supported. ### 7. `signum` rejects interval inputs Spark `Signum` accepts `DayTimeIntervalType` / `YearMonthIntervalType` via `inputTypes`; Comet only handles `DoubleType`. Either add an explicit fallback or extend the native path. --- Surfaced by the `audit-comet-expression` skill run in #4486. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
