andygrove opened a new pull request, #4620: URL: https://github.com/apache/datafusion-comet/pull/4620
## Which issue does this PR close? Closes #4619. ## Rationale for this change The CSV / JSON / XPath / XML structured-text functions had no Comet implementation, so any query using them fell back to Spark for the enclosing operator. They are hard to implement natively in Rust (parsing with Spark-specific semantics), but they all extend Spark's `CodegenFallback`, which the codegen dispatcher already admits (the same mechanism that backs `from_json`/`to_json`). Routing them through the dispatcher keeps a top-level projection native while matching Spark exactly. ## What changes are included in this PR? Wire the following previously-unsupported functions through the codegen dispatcher (no native rust path): - `from_csv`, `schema_of_csv` - `schema_of_json`, `json_object_keys` - `xpath`, `xpath_boolean/short/int/long/float/double/string` - `from_xml`, `to_xml`, `schema_of_xml` (Spark 4.0+ only) Wiring differs by Spark version: - On Spark 3.4/3.5 these are plain `CodegenFallback` expressions, registered directly in the serde maps (`CometCsvToStructs`, `CometSchemaOfJson`, `CometXPathInt`, etc.). - On Spark 4.x they are `RuntimeReplaceable`: the optimizer rewrites them to `Invoke(evaluator)` / `StaticInvoke` before Comet sees the physical plan, so they are dispatched from `CometExprShim4x.convertStructuredText`. It matches the backing evaluators by simple name to stay robust across 4.0/4.1/4.2, where the classes have moved between packages. `from_xml`/`to_xml` stay as plain expressions and are matched by type. The match is scoped to the structured-text evaluators so other `Invoke`/`StaticInvoke` nodes (e.g. `to_json`, `parse_url`) keep their existing handling. When `spark.comet.exec.scalaUDF.codegen.enabled=false`, these fall back to Spark. ## How are these changes tested? Adds `CometStructuredTextSuite`, which asserts each function stays native and matches Spark (`checkSparkAnswerAndOperator`) over parquet-backed string columns with null and empty rows, plus fall-back-when-disabled coverage for a CSV, an XPath, and an XML function. The XML tests are gated to Spark 4.0+. Verified locally on the spark-3.4, 3.5, 4.0, 4.1, and 4.2 profiles (CSV/JSON/XPath on all five; XML on 4.0+). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
