cloud-fan commented on code in PR #56685:
URL: https://github.com/apache/spark/pull/56685#discussion_r3457325485
##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala:
##########
@@ -189,10 +194,18 @@ case class MultiGetJsonObject(
final override val nodePatterns: Seq[TreePattern] = Seq(GET_JSON_OBJECT)
+ @transient
+ private lazy val namedPaths = fallbackPaths.map { path =>
Review Comment:
Now that `namedPaths` is re-derived here from `fallbackPaths`, `fieldNames`
has become vestigial: its content is never read anymore — only its
`.length`/`.indices` is used (struct arity in `dataType`, and sizing of
`nullRow`/`values`/`matched` in the evaluator). Before this PR the leaf names
drove extraction via `fieldToOrdinal`; now both the trie and
`topLevelFieldToOrdinal` key off `namedPaths` instead. You correctly dropped
the old `fieldNames.distinct` requirement (nested paths can legitimately share
a leaf), and the optimizer test even asserts `fieldNames == Seq("b", "c.d",
"e", "b")` with a duplicate — which confirms the field no longer carries
identity.
Consider dropping `fieldNames` from both `MultiGetJsonObject` and
`MultiGetJsonObjectEvaluator` and deriving the arity from `fallbackPaths`. It
removes a third overlapping representation that must be kept in sync and avoids
misleading a future reader into thinking the leaf names matter (e.g. for output
naming, which actually uses `_$index`). Non-blocking.
##########
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/SharedJsonParseBenchmark.scala:
##########
@@ -75,6 +75,40 @@ object SharedJsonParseBenchmark extends SqlBasedBenchmark {
}
data.unpersist()
+
+ val nestedData = spark.range(0, rows, 1, 4)
+ .select(to_json(struct(struct(Seq.tabulate(fieldCount) { index =>
+ fieldValue.as(s"field_$index")
+ }: _*).as("payload"))).as("json"))
+ .cache()
+ nestedData.count()
+
+ Seq(2, 4, 8, 16).foreach { selectedFieldCount =>
+ val pathBenchmark = new Benchmark(
+ s"get_json_object extracting $selectedFieldCount of $fieldCount
nested fields",
Review Comment:
These new nested-path cases aren't reflected in the committed result files:
`sql/core/benchmarks/SharedJsonParseBenchmark-results.txt` (and the
`-jdk21-`/`-jdk25-` variants) were last generated in SPARK-47670 and still
contain only the top-level `extracting N of 32 fields` rows. Per Spark
convention, regenerate and commit them so the committed results match the
benchmark, or confirm they'll be regenerated before merge. Non-blocking.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]