sunchao opened a new pull request, #56685: URL: https://github.com/apache/spark/pull/56685
## Why are the changes needed? [SPARK-47670](https://issues.apache.org/jira/browse/SPARK-47670) and #56547 added opt-in shared parsing for repeated `get_json_object` calls over the same input. The initial implementation intentionally shares only simple top-level fields, so repeated literal nested object paths still parse the same JSON independently. For example, consider an `events` table whose `json` column contains: ```json {"payload":{"user":{"id":123,"name":"alice"},"request_id":"r-1"}} ``` An existing query might run: ```sql SELECT get_json_object(json, '$.payload.user.id') AS user_id, get_json_object(json, '$.payload.user.name') AS user_name, get_json_object(json, '$.payload.request_id') AS request_id FROM events; ``` Before this PR, Spark parses each row's `json` value independently for all three nested extractions. With shared parsing enabled, Catalyst rewrites them to one internal `MultiGetJsonObject`, so the input is scanned once and all three values are returned from that scan. The SQL does not change. This follow-up preserves the existing `get_json_object` behavior for malformed input, duplicate keys, nulls, non-object intermediate values, and rendering failures. It also preserves existing plans and execution behavior while `spark.sql.optimizer.getJsonObjectSharedParsing.enabled` is off, which remains the default. ## What changes were proposed in this PR? This PR extends the existing internal `MultiGetJsonObject` expression and Catalyst rewrite: - Parse literal named JSON paths into field-name segments and group repeated extractions over the same input. Dot notation such as `$.payload.user.id` and quoted names such as `$['payload']['user.name']` are supported. - Build a runtime path trie so multiple nested fields can be extracted in one streaming JSON scan. - Preserve the first non-null duplicate-key match independently for each requested path, including duplicate parent objects and non-object intermediate values. - Never place an ancestor path and its descendant in the same shared parse. - Select paths with insertion-ordered prefix tries, preserving first-path-wins behavior without quadratic pairwise comparisons. - Build all greedy prefix-free groups in one optimizer invocation, avoiding repeated fixed-point passes for parallel prefix chains. - Keep the existing top-level fast path, so top-level sharing does not pay for runtime trie traversal. - Leave dynamic paths, wildcards, array subscripts, and paths deeper than 64 named fields on their existing independent `GetJsonObject` evaluation. - Continue to use the existing default-off feature flag. With the flag off, the sharing rewrite is not invoked and existing plans remain unchanged. - Extend the existing microbenchmark with repeated paths under a nested `payload` object. There is no new user-facing expression and no query migration is required. Users opt in with the existing configuration: ```sql SET spark.sql.optimizer.getJsonObjectSharedParsing.enabled = true; ``` ### Path grouping examples These sibling paths are prefix-free and share one parse: ```text $.payload.user.id $.payload.user.name $.payload.request_id ``` Ancestor and descendant paths cannot share the same parse. For this ordered request: ```text $.a $.a.b $.x $.x.y ``` the optimizer builds both groups in one invocation: ```text group 1: $.a, $.x group 2: $.a.b, $.x.y ``` `$.a` never shares a parse with `$.a.b`, and `$.x` never shares one with `$.x.y`, so each path retains its independent legacy semantics. Unsupported path forms remain unchanged. For example, `$.items[0].id`, `$.payload.*`, and a path supplied by another column continue to use the existing `GetJsonObject` evaluation. ## How was this PR tested? - `build/sbt "catalyst/compile" "catalyst/Test/compile" "sql/Test/compile"` — passed on JDK 17. - `build/sbt "catalyst/testOnly org.apache.spark.sql.catalyst.optimizer.OptimizeJsonExprsSuite"` — 24 tests passed, including nested sharing, both prefix-conflict directions, one-pass grouping of parallel prefix chains, a 2,000-path wide projection, unsupported paths, depth limits, and the default-off plan check. - `build/sbt "sql/testOnly org.apache.spark.sql.JsonFunctionsSuite"` — 106 tests passed, including shared-vs-legacy equivalence for duplicate keys, nulls, malformed JSON, single-quoted JSON, non-object intermediate values, rendering failures, and whole-stage code generation. - `build/sbt "hive/Test/testOnly org.apache.spark.sql.configaudit.SparkConfigBindingPolicySuite"` — 3 tests passed. - `build/sbt "catalyst/scalastyle" "catalyst/Test/scalastyle" "sql/Test/scalastyle"` — passed with no findings. - `git diff --check`. The microbenchmark was run with: `build/sbt "sql/Test/runMain org.apache.spark.sql.execution.datasources.json.SharedJsonParseBenchmark"` Results for 200,000 cached rows with 32 fields under a nested `payload` object: | Nested paths extracted | Feature flag off | Feature flag on | Speedup | | ---: | ---: | ---: | ---: | | 2 | 642 ms | 410 ms | 1.6x | | 4 | 1,304 ms | 435 ms | 3.0x | | 8 | 2,525 ms | 556 ms | 4.5x | | 16 | 5,603 ms | 901 ms | 6.2x | These are local expression-scaling measurements on JDK 17 and an Apple M5 Max, not end-to-end production-job results. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
