sunchao opened a new pull request, #56547:
URL: https://github.com/apache/spark/pull/56547

   ### Why are the changes needed?
   
   Each `get_json_object` expression currently parses its JSON input 
independently. A projection that extracts several fields from the same JSON 
string therefore repeats the scan and parse work for every path, so its CPU 
cost grows roughly with the number of extracted fields.
   
   [SPARK-47670](https://issues.apache.org/jira/browse/SPARK-47670) proposes 
sharing that parsing work. This PR implements a conservative first scope for 
deterministic, simple top-level paths. Nested paths, wildcards, array 
subscripts, computed JSON inputs, and evaluation-order-sensitive expression 
shapes remain unchanged so that Hive-compatible behavior is preserved.
   
   ### What changes were proposed in this PR?
   
   * Add an internal `MultiGetJsonObject` expression and evaluator that 
extracts several top-level fields with one Jackson parse and returns them as a 
struct.
   * Extend `OptimizeCsvJsonExprs` to group eligible sibling `get_json_object` 
expressions over the same input attribute and rewrite them to fields of one 
shared parse.
   * Preserve legacy behavior for malformed JSON, null and duplicate keys, 
generator-side rendering failures, and parser-side failures. The shared 
evaluator falls back to the existing per-path evaluator when sharing cannot 
safely preserve independent results.
   * Keep nested paths, wildcards, array subscripts, conditional or guarded 
evaluation, unsupported arithmetic order, and separately projected `from_json` 
expressions out of the rewrite.
   * Gate the optimization behind the internal, default-disabled 
`spark.sql.optimizer.getJsonObjectSharedParsing.enabled` configuration. The 
existing JSON expression optimization must also be enabled.
   * Add optimizer, runtime, code-generation, malformed-input, deep-nesting, 
and microbenchmark coverage.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes. It adds an internal, default-disabled configuration. With the flag 
disabled, existing analyzed and optimized plans are unchanged. When enabled, 
eligible repeated top-level `get_json_object` calls use a shared parse; their 
result semantics remain unchanged.
   
   ### How was this PR tested?
   
   The following suites and style checks passed on JDK 17:
   
   ```
   build/sbt "catalyst/testOnly 
org.apache.spark.sql.catalyst.optimizer.OptimizeJsonExprsSuite"
   build/sbt "sql/testOnly org.apache.spark.sql.JsonFunctionsSuite"
   build/sbt "catalyst/scalastyle; catalyst/Test/scalastyle; sql/scalastyle; 
sql/Test/scalastyle"
   ```
   
   The complete optimizer suite passed 20 tests, and the complete 
`JsonFunctionsSuite` passed 104 tests.
   
   The added benchmark was generated with:
   
   ```
   SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/Test/runMain 
org.apache.spark.sql.execution.datasources.json.SharedJsonParseBenchmark"
   ```
   
   Best times for 200,000 rows containing 32 JSON fields were:
   
   | Extracted paths | Shared parsing off | Shared parsing on | Relative 
speedup |
   |---:|---:|---:|---:|
   | 2 | 740 ms | 390 ms | 1.9x |
   | 4 | 1,387 ms | 480 ms | 2.9x |
   | 8 | 3,154 ms | 727 ms | 4.3x |
   | 16 | 5,109 ms | 709 ms | 7.2x |
   
   The benchmark result file records the full JDK 17 environment and output.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   Generated-by: OpenAI Codex (GPT-5)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to