sunchao opened a new pull request, #56685:
URL: https://github.com/apache/spark/pull/56685

   ## Why are the changes needed?
   
   [SPARK-47670](https://issues.apache.org/jira/browse/SPARK-47670) and #56547 
added opt-in shared parsing for repeated `get_json_object` calls over the same 
input. The initial implementation intentionally shares only simple top-level 
fields, so repeated literal nested object paths still parse the same JSON 
independently.
   
   For example, consider an `events` table whose `json` column contains:
   
   ```json
   {"payload":{"user":{"id":123,"name":"alice"},"request_id":"r-1"}}
   ```
   
   An existing query might run:
   
   ```sql
   SELECT
     get_json_object(json, '$.payload.user.id') AS user_id,
     get_json_object(json, '$.payload.user.name') AS user_name,
     get_json_object(json, '$.payload.request_id') AS request_id
   FROM events;
   ```
   
   Before this PR, Spark parses each row's `json` value independently for all 
three nested extractions. With shared parsing enabled, Catalyst rewrites them 
to one internal `MultiGetJsonObject`, so the input is scanned once and all 
three values are returned from that scan. The SQL does not change.
   
   This follow-up preserves the existing `get_json_object` behavior for 
malformed input, duplicate keys, nulls, non-object intermediate values, and 
rendering failures. It also preserves existing plans and execution behavior 
while `spark.sql.optimizer.getJsonObjectSharedParsing.enabled` is off, which 
remains the default.
   
   ## What changes were proposed in this PR?
   
   This PR extends the existing internal `MultiGetJsonObject` expression and 
Catalyst rewrite:
   
   - Parse literal named JSON paths into field-name segments and group repeated 
extractions over the same input. Dot notation such as `$.payload.user.id` and 
quoted names such as `$['payload']['user.name']` are supported.
   - Build a runtime path trie so multiple nested fields can be extracted in 
one streaming JSON scan.
   - Preserve the first non-null duplicate-key match independently for each 
requested path, including duplicate parent objects and non-object intermediate 
values.
   - Never place an ancestor path and its descendant in the same shared parse.
   - Select paths with insertion-ordered prefix tries, preserving 
first-path-wins behavior without quadratic pairwise comparisons.
   - Build all greedy prefix-free groups in one optimizer invocation, avoiding 
repeated fixed-point passes for parallel prefix chains.
   - Keep the existing top-level fast path, so top-level sharing does not pay 
for runtime trie traversal.
   - Leave dynamic paths, wildcards, array subscripts, and paths deeper than 64 
named fields on their existing independent `GetJsonObject` evaluation.
   - Continue to use the existing default-off feature flag. With the flag off, 
the sharing rewrite is not invoked and existing plans remain unchanged.
   - Extend the existing microbenchmark with repeated paths under a nested 
`payload` object.
   
   There is no new user-facing expression and no query migration is required. 
Users opt in with the existing configuration:
   
   ```sql
   SET spark.sql.optimizer.getJsonObjectSharedParsing.enabled = true;
   ```
   
   ### Path grouping examples
   
   These sibling paths are prefix-free and share one parse:
   
   ```text
   $.payload.user.id
   $.payload.user.name
   $.payload.request_id
   ```
   
   Ancestor and descendant paths cannot share the same parse. For this ordered 
request:
   
   ```text
   $.a
   $.a.b
   $.x
   $.x.y
   ```
   
   the optimizer builds both groups in one invocation:
   
   ```text
   group 1: $.a,   $.x
   group 2: $.a.b, $.x.y
   ```
   
   `$.a` never shares a parse with `$.a.b`, and `$.x` never shares one with 
`$.x.y`, so each path retains its independent legacy semantics.
   
   Unsupported path forms remain unchanged. For example, `$.items[0].id`, 
`$.payload.*`, and a path supplied by another column continue to use the 
existing `GetJsonObject` evaluation.
   
   ## How was this PR tested?
   
   - `build/sbt "catalyst/compile" "catalyst/Test/compile" "sql/Test/compile"` 
— passed on JDK 17.
   - `build/sbt "catalyst/testOnly 
org.apache.spark.sql.catalyst.optimizer.OptimizeJsonExprsSuite"` — 24 tests 
passed, including nested sharing, both prefix-conflict directions, one-pass 
grouping of parallel prefix chains, a 2,000-path wide projection, unsupported 
paths, depth limits, and the default-off plan check.
   - `build/sbt "sql/testOnly org.apache.spark.sql.JsonFunctionsSuite"` — 106 
tests passed, including shared-vs-legacy equivalence for duplicate keys, nulls, 
malformed JSON, single-quoted JSON, non-object intermediate values, rendering 
failures, and whole-stage code generation.
   - `build/sbt "hive/Test/testOnly 
org.apache.spark.sql.configaudit.SparkConfigBindingPolicySuite"` — 3 tests 
passed.
   - `build/sbt "catalyst/scalastyle" "catalyst/Test/scalastyle" 
"sql/Test/scalastyle"` — passed with no findings.
   - `git diff --check`.
   
   The microbenchmark was run with:
   
   `build/sbt "sql/Test/runMain 
org.apache.spark.sql.execution.datasources.json.SharedJsonParseBenchmark"`
   
   Results for 200,000 cached rows with 32 fields under a nested `payload` 
object:
   
   | Nested paths extracted | Feature flag off | Feature flag on | Speedup |
   | ---: | ---: | ---: | ---: |
   | 2 | 642 ms | 410 ms | 1.6x |
   | 4 | 1,304 ms | 435 ms | 3.0x |
   | 8 | 2,525 ms | 556 ms | 4.5x |
   | 16 | 5,603 ms | 901 ms | 6.2x |
   
   These are local expression-scaling measurements on JDK 17 and an Apple M5 
Max, not end-to-end production-job results.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to