andygrove opened a new issue, #4506: URL: https://github.com/apache/datafusion-comet/issues/4506
## Background A Comet expression marked `Incompatible(Some(reason))` in `getSupportLevel` is the native Rust implementation telling the dispatcher "I diverge from Spark in this case." Today the dispatcher behaviour is binary: - `spark.comet.expr.allowIncompatible=false` (default): Comet falls back to Spark for the whole projection, losing the native pipeline for every other expression in that projection. - `spark.comet.expr.allowIncompatible=true`: Comet runs the divergent native impl and the user opts in to the wrong-answer risk. Neither default is great. The fallback path silently regresses performance for users who never opted in to the trade-off, and `allowIncompatible=true` changes correctness for every Incompatible expression at once. ## Proposal For each `Incompatible` branch, provide a JVM-side implementation routed through the existing codegen-dispatch path (the same mechanism that wires `CometScalaUDF` to Janino-codegen Scala functions). The dispatcher then has a third option that is always correct: stay in Comet, evaluate the divergent expression in the JVM via codegen, and keep the rest of the projection native. This lets us flip the default for Incompatible expressions to "Comet handles it correctly without falling back," and leaves `allowIncompatible=true` purely as a perf knob for users who accept the native-path divergence. ## Scope by category Each item below is an `Incompatible` branch with a file:line anchor in `spark/src/main/scala/org/apache/comet/serde/`. Conditional Incompatibles (e.g. only for `TimestampNTZ` or only with `COMET_EXEC_STRICT_FLOATING_POINT=true`) only need the JVM dispatch for the divergent branch; the rest stays on the native path. ### String - `CometInitCap` (`initcap`) - `strings.scala:98` - unconditional Incompatible (Unicode case mapping divergence). - `CometConcat` (`concat`) - `strings.scala:236` - Incompatible for `ArrayType` inputs. - `CometRegExpReplace` (`regexp_replace`) - `strings.scala:384` - Java/Rust regex engine differences. - `CometStringSplit` (`split`) - `strings.scala:424` - Java/Rust regex engine differences. - `CometGetJsonObject` (`get_json_object`) - `strings.scala:451` - single-quoted JSON / control-character handling diverges from Spark. ### Date/time - `CometHour` (`hour`) - `datetime.scala:191` - Incompatible for `TimestampNTZType` (#3180). - `CometMinute` (`minute`) - `datetime.scala:233` - same (#3180). - `CometSecond` (`second`) - `datetime.scala:275` - same (#3180). - `CometFromUTCTimestamp` (`from_utc_timestamp`) - `datetime.scala:380` - timezone parser divergence. - `CometToUTCTimestamp` (`to_utc_timestamp`) - `datetime.scala:398` - same. - `CometConvertTimezone` (`convert_timezone`) - `datetime.scala:416` - same. - `CometTruncDate` (`trunc`) - `datetime.scala:500` - invalid format strings throw instead of returning NULL. - `CometTruncTimestamp` (`date_trunc`) - `datetime.scala:555,564` - non-UTC timezone (#2649) and invalid format strings. - `CometFromUnixTime` (`from_unixtime`) - `unixtime.scala:37` - unconditional Incompatible. ### Array - `CometSortArray` (`sort_array`) - `arrays.scala:157` - Incompatible for floating-point with `spark.comet.exec.strict.floatingPoint=true`. - `CometArrayIntersect` (`array_intersect`) - `arrays.scala:211` - documented divergence (ordering / duplicate handling). - `CometArrayExcept` (`array_except`) - `arrays.scala:340` - unconditional Incompatible. - `CometArrayJoin` (`array_join`) - `arrays.scala:381` - unconditional Incompatible. - `CometArrayReverse` (`reverse`) - `arrays.scala:539` - Incompatible when element type contains `BinaryType`. ### Map - `CometMapFromEntries` (`map_from_entries`) - `maps.scala:154,157` - per-key-type and per-value-type incompat reasons. ### Struct / JSON - `CometJsonToStructs` (`from_json`) - `structs.scala:173` - partially implemented and not comprehensively tested (#3232). - `CometStructsToCsv` (`to_csv`) - `structs.scala:254,260` - incompatible data types in schema. ### URL - `CometParseUrl` (`parse_url`) - `url.scala:36` - URL parser divergence. ### Aggregate - `CometCollectSet` (`collect_set`) - `aggregates.scala:722` - floating-point with `spark.comet.exec.strict.floatingPoint=true` (Comet dedupes NaN, Spark keeps each NaN distinct). ## Dispatcher work In addition to the per-expression JVM impls, the dispatcher in `QueryPlanSerde` needs a new branch for `Incompatible` that prefers the codegen-dispatch path when a JVM impl is registered, and only falls back to Spark (or runs the native path under `allowIncompatible=true`) when no JVM impl is wired. The existing `CometScalaUDF` -> codegen-dispatch plumbing is the model. ## Out of scope - `Unsupported` branches (no native impl exists at all - falling back is the only option). - Expressions where the native path is `Compatible` but the support level was set to `Incompatible` only because of a Spark 4.0+ collation gap covered by the umbrella collation work (#2190, #4496). Those should be addressed by the collation effort rather than per-expression JVM dispatch. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
