andygrove opened a new issue, #4506:
URL: https://github.com/apache/datafusion-comet/issues/4506

   ## Background
   
   A Comet expression marked `Incompatible(Some(reason))` in `getSupportLevel` 
is the native Rust implementation telling the dispatcher "I diverge from Spark 
in this case." Today the dispatcher behaviour is binary:
   
   - `spark.comet.expr.allowIncompatible=false` (default): Comet falls back to 
Spark for the whole projection, losing the native pipeline for every other 
expression in that projection.
   - `spark.comet.expr.allowIncompatible=true`: Comet runs the divergent native 
impl and the user opts in to the wrong-answer risk.
   
   Neither default is great. The fallback path silently regresses performance 
for users who never opted in to the trade-off, and `allowIncompatible=true` 
changes correctness for every Incompatible expression at once.
   
   ## Proposal
   
   For each `Incompatible` branch, provide a JVM-side implementation routed 
through the existing codegen-dispatch path (the same mechanism that wires 
`CometScalaUDF` to Janino-codegen Scala functions). The dispatcher then has a 
third option that is always correct: stay in Comet, evaluate the divergent 
expression in the JVM via codegen, and keep the rest of the projection native.
   
   This lets us flip the default for Incompatible expressions to "Comet handles 
it correctly without falling back," and leaves `allowIncompatible=true` purely 
as a perf knob for users who accept the native-path divergence.
   
   ## Scope by category
   
   Each item below is an `Incompatible` branch with a file:line anchor in 
`spark/src/main/scala/org/apache/comet/serde/`. Conditional Incompatibles (e.g. 
only for `TimestampNTZ` or only with `COMET_EXEC_STRICT_FLOATING_POINT=true`) 
only need the JVM dispatch for the divergent branch; the rest stays on the 
native path.
   
   ### String
   
   - `CometInitCap` (`initcap`) - `strings.scala:98` - unconditional 
Incompatible (Unicode case mapping divergence).
   - `CometConcat` (`concat`) - `strings.scala:236` - Incompatible for 
`ArrayType` inputs.
   - `CometRegExpReplace` (`regexp_replace`) - `strings.scala:384` - Java/Rust 
regex engine differences.
   - `CometStringSplit` (`split`) - `strings.scala:424` - Java/Rust regex 
engine differences.
   - `CometGetJsonObject` (`get_json_object`) - `strings.scala:451` - 
single-quoted JSON / control-character handling diverges from Spark.
   
   ### Date/time
   
   - `CometHour` (`hour`) - `datetime.scala:191` - Incompatible for 
`TimestampNTZType` (#3180).
   - `CometMinute` (`minute`) - `datetime.scala:233` - same (#3180).
   - `CometSecond` (`second`) - `datetime.scala:275` - same (#3180).
   - `CometFromUTCTimestamp` (`from_utc_timestamp`) - `datetime.scala:380` - 
timezone parser divergence.
   - `CometToUTCTimestamp` (`to_utc_timestamp`) - `datetime.scala:398` - same.
   - `CometConvertTimezone` (`convert_timezone`) - `datetime.scala:416` - same.
   - `CometTruncDate` (`trunc`) - `datetime.scala:500` - invalid format strings 
throw instead of returning NULL.
   - `CometTruncTimestamp` (`date_trunc`) - `datetime.scala:555,564` - non-UTC 
timezone (#2649) and invalid format strings.
   - `CometFromUnixTime` (`from_unixtime`) - `unixtime.scala:37` - 
unconditional Incompatible.
   
   ### Array
   
   - `CometSortArray` (`sort_array`) - `arrays.scala:157` - Incompatible for 
floating-point with `spark.comet.exec.strict.floatingPoint=true`.
   - `CometArrayIntersect` (`array_intersect`) - `arrays.scala:211` - 
documented divergence (ordering / duplicate handling).
   - `CometArrayExcept` (`array_except`) - `arrays.scala:340` - unconditional 
Incompatible.
   - `CometArrayJoin` (`array_join`) - `arrays.scala:381` - unconditional 
Incompatible.
   - `CometArrayReverse` (`reverse`) - `arrays.scala:539` - Incompatible when 
element type contains `BinaryType`.
   
   ### Map
   
   - `CometMapFromEntries` (`map_from_entries`) - `maps.scala:154,157` - 
per-key-type and per-value-type incompat reasons.
   
   ### Struct / JSON
   
   - `CometJsonToStructs` (`from_json`) - `structs.scala:173` - partially 
implemented and not comprehensively tested (#3232).
   - `CometStructsToCsv` (`to_csv`) - `structs.scala:254,260` - incompatible 
data types in schema.
   
   ### URL
   
   - `CometParseUrl` (`parse_url`) - `url.scala:36` - URL parser divergence.
   
   ### Aggregate
   
   - `CometCollectSet` (`collect_set`) - `aggregates.scala:722` - 
floating-point with `spark.comet.exec.strict.floatingPoint=true` (Comet dedupes 
NaN, Spark keeps each NaN distinct).
   
   ## Dispatcher work
   
   In addition to the per-expression JVM impls, the dispatcher in 
`QueryPlanSerde` needs a new branch for `Incompatible` that prefers the 
codegen-dispatch path when a JVM impl is registered, and only falls back to 
Spark (or runs the native path under `allowIncompatible=true`) when no JVM impl 
is wired. The existing `CometScalaUDF` -> codegen-dispatch plumbing is the 
model.
   
   ## Out of scope
   
   - `Unsupported` branches (no native impl exists at all - falling back is the 
only option).
   - Expressions where the native path is `Compatible` but the support level 
was set to `Incompatible` only because of a Spark 4.0+ collation gap covered by 
the umbrella collation work (#2190, #4496). Those should be addressed by the 
collation effort rather than per-expression JVM dispatch.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to