[PR] feat: route structured-text functions through codegen dispatcher [datafusion-comet]

via GitHub Wed, 10 Jun 2026 08:28:53 -0700


andygrove opened a new pull request, #4620:
URL: https://github.com/apache/datafusion-comet/pull/4620


   ## Which issue does this PR close?
   
   Closes #4619.
   
   ## Rationale for this change
   
   The CSV / JSON / XPath / XML structured-text functions had no Comet 
implementation, so any query using them fell back to Spark for the enclosing 
operator. They are hard to implement natively in Rust (parsing with 
Spark-specific semantics), but they all extend Spark's `CodegenFallback`, which 
the codegen dispatcher already admits (the same mechanism that backs 
`from_json`/`to_json`). Routing them through the dispatcher keeps a top-level 
projection native while matching Spark exactly.
   
   ## What changes are included in this PR?
   
   Wire the following previously-unsupported functions through the codegen 
dispatcher (no native rust path):
   
   - `from_csv`, `schema_of_csv`
   - `schema_of_json`, `json_object_keys`
   - `xpath`, `xpath_boolean/short/int/long/float/double/string`
   - `from_xml`, `to_xml`, `schema_of_xml` (Spark 4.0+ only)
   
   Wiring differs by Spark version:
   
   - On Spark 3.4/3.5 these are plain `CodegenFallback` expressions, registered 
directly in the serde maps (`CometCsvToStructs`, `CometSchemaOfJson`, 
`CometXPathInt`, etc.).
   - On Spark 4.x they are `RuntimeReplaceable`: the optimizer rewrites them to 
`Invoke(evaluator)` / `StaticInvoke` before Comet sees the physical plan, so 
they are dispatched from `CometExprShim4x.convertStructuredText`. It matches 
the backing evaluators by simple name to stay robust across 4.0/4.1/4.2, where 
the classes have moved between packages. `from_xml`/`to_xml` stay as plain 
expressions and are matched by type. The match is scoped to the structured-text 
evaluators so other `Invoke`/`StaticInvoke` nodes (e.g. `to_json`, `parse_url`) 
keep their existing handling.
   
   When `spark.comet.exec.scalaUDF.codegen.enabled=false`, these fall back to 
Spark.
   
   ## How are these changes tested?
   
   Adds `CometStructuredTextSuite`, which asserts each function stays native 
and matches Spark (`checkSparkAnswerAndOperator`) over parquet-backed string 
columns with null and empty rows, plus fall-back-when-disabled coverage for a 
CSV, an XPath, and an XML function. The XML tests are gated to Spark 4.0+. 
Verified locally on the spark-3.4, 3.5, 4.0, 4.1, and 4.2 profiles 
(CSV/JSON/XPath on all five; XML on 4.0+).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] feat: route structured-text functions through codegen dispatcher [datafusion-comet]

Reply via email to