Re: [PR] [SPARK-57626][SQL] Share repeated nested JSON path parsing [spark]

via GitHub Mon, 22 Jun 2026 22:47:59 -0700


cloud-fan commented on code in PR #56685:
URL: https://github.com/apache/spark/pull/56685#discussion_r3457325485



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala:
##########
@@ -189,10 +194,18 @@ case class MultiGetJsonObject(
 
   final override val nodePatterns: Seq[TreePattern] = Seq(GET_JSON_OBJECT)
 
+  @transient
+  private lazy val namedPaths = fallbackPaths.map { path =>

Review Comment:
   Now that `namedPaths` is re-derived here from `fallbackPaths`, `fieldNames` 
has become vestigial: its content is never read anymore — only its 
`.length`/`.indices` is used (struct arity in `dataType`, and sizing of 
`nullRow`/`values`/`matched` in the evaluator). Before this PR the leaf names 
drove extraction via `fieldToOrdinal`; now both the trie and 
`topLevelFieldToOrdinal` key off `namedPaths` instead. You correctly dropped 
the old `fieldNames.distinct` requirement (nested paths can legitimately share 
a leaf), and the optimizer test even asserts `fieldNames == Seq("b", "c.d", 
"e", "b")` with a duplicate — which confirms the field no longer carries 
identity.
   
   Consider dropping `fieldNames` from both `MultiGetJsonObject` and 
`MultiGetJsonObjectEvaluator` and deriving the arity from `fallbackPaths`. It 
removes a third overlapping representation that must be kept in sync and avoids 
misleading a future reader into thinking the leaf names matter (e.g. for output 
naming, which actually uses `_$index`). Non-blocking.



##########
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/SharedJsonParseBenchmark.scala:
##########
@@ -75,6 +75,40 @@ object SharedJsonParseBenchmark extends SqlBasedBenchmark {
       }
 
       data.unpersist()
+
+      val nestedData = spark.range(0, rows, 1, 4)
+        .select(to_json(struct(struct(Seq.tabulate(fieldCount) { index =>
+          fieldValue.as(s"field_$index")
+        }: _*).as("payload"))).as("json"))
+        .cache()
+      nestedData.count()
+
+      Seq(2, 4, 8, 16).foreach { selectedFieldCount =>
+        val pathBenchmark = new Benchmark(
+          s"get_json_object extracting $selectedFieldCount of $fieldCount 
nested fields",

Review Comment:
   These new nested-path cases aren't reflected in the committed result files: 
`sql/core/benchmarks/SharedJsonParseBenchmark-results.txt` (and the 
`-jdk21-`/`-jdk25-` variants) were last generated in SPARK-47670 and still 
contain only the top-level `extracting N of 32 fields` rows. Per Spark 
convention, regenerate and commit them so the committed results match the 
benchmark, or confirm they'll be regenerated before merge. Non-blocking.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-57626][SQL] Share repeated nested JSON path parsing [spark]

Reply via email to