andygrove opened a new pull request, #3747:
URL: https://github.com/apache/datafusion-comet/pull/3747

   ## Which issue does this PR close?
   
   Closes #3162.
   
   ## Rationale for this change
   
   `get_json_object` is a widely-used Spark function for extracting values from 
JSON strings using JSONPath expressions. Without native support, queries using 
this function fall back to Spark's JVM execution. This PR adds an initial 
native implementation to allow Comet to accelerate these queries.
   
   This is a starting point. The expression is marked `Incompatible` and is 
**disabled by default**. Users must set 
`spark.comet.expression.GetJsonObject.allowIncompatible=true` to enable it.
   
   ## What changes are included in this PR?
   
   **Rust implementation** 
(`native/spark-expr/src/string_funcs/get_json_object.rs`):
   - Custom JSONPath parser supporting `$` (root), `.field`, `['field']` 
(bracket notation), `[n]` (array index), and `[*]` (array wildcard)
   - Path evaluation with separate fast-path for non-wildcard paths (zero Vec 
allocations) and wildcard paths
   - Uses `serde_json` with `preserve_order` feature for Spark-compatible key 
ordering
   - 19 unit tests
   
   **Scala serde** 
(`spark/src/main/scala/org/apache/comet/serde/strings.scala`):
   - `CometGetJsonObject` with `getSupportLevel` returning `Incompatible` 
(Spark's Jackson parser allows single-quoted JSON and unescaped control 
characters that `serde_json` does not)
   
   **Registration and wiring:**
   - Added to `stringExpressions` map in `QueryPlanSerde.scala`
   - Registered in `comet_scalar_funcs.rs` via 
`scalarFunctionExprToProtoWithReturnType`
   
   **SQL tests** (`get_json_object.sql`): 30 test queries covering field 
extraction, nested objects, arrays, wildcards, nulls, invalid JSON, bracket 
notation, edge cases.
   
   **Docs**: Updated `expressions.md` and `spark_expressions_support.md`.
   
   ### Current performance
   
   Benchmarked with 1M rows of JSON (~200 bytes each) on Apple M3 Ultra:
   
   | Case | Spark (ms) | Comet (ms) | Relative |
   |------|-----------|------------|----------|
   | Simple field (`$.name`) | 705 | 785 | 0.9X |
   | Numeric field (`$.age`) | 725 | 789 | 0.9X |
   | Nested field (`$.address.city`) | 773 | 805 | 1.0X |
   | Array element (`$.items[0]`) | 734 | 795 | 0.9X |
   | Nested object (`$.address`) | 869 | 926 | 0.9X |
   
   Comet is currently ~10% slower than Spark. The primary reason is that 
`serde_json` parses the full JSON document into a DOM tree on every row, while 
Spark's Jackson-based implementation uses a streaming parser that can skip 
irrelevant fields without allocating.
   
   ### Known limitations and future work
   
   This is an initial implementation. Known gaps that could be addressed in 
follow-up PRs:
   
   1. **Streaming JSON parser**: Replace `serde_json::from_str` (full DOM 
parse) with a streaming approach (e.g., `jiter` or custom 
`serde_json::Deserializer` with `IgnoredAny`) to skip irrelevant JSON content 
without allocating. This would likely close the performance gap with Spark.
   2. **`$.*` on arrays**: Spark distinguishes `$.*` (object wildcard, using 
`Wildcard` token) from `$[*]` (array wildcard, using `Subscript::Wildcard`). 
Our parser treats both as the same `Wildcard` segment. Currently `$.*` on 
arrays returns values in Comet but null in Spark.
   3. **Double wildcard flattening**: Spark's `$[*][*]` triggers `FlattenStyle` 
which flattens nested arrays. Our implementation doesn't handle this special 
case.
   4. **Single wildcard match after index**: For patterns like 
`$.arr[0][*].field`, Spark's `WriteStyle` state machine may produce different 
wrapping behavior than our count-based approach.
   5. **`preserve_order` is workspace-wide**: Cargo unifies features, so 
enabling `preserve_order` on `serde_json` in `spark-expr` also enables it for 
all other crates in the workspace. Could be addressed by isolating the JSON 
parsing behind a feature flag.
   
   ## How are these changes tested?
   
   - 19 Rust unit tests covering path parsing and evaluation edge cases
   - 30 SQL-file-based tests (`CometSqlFileTestSuite`) that run each query 
through both Spark and Comet and compare results, with dictionary encoding 
on/off
   - Microbenchmark (`CometGetJsonObjectBenchmark`) comparing Spark vs Comet 
performance across 5 query patterns


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to