Chao Sun created SPARK-57626:
--------------------------------
Summary: Extend shared get_json_object parsing to nested named
paths
Key: SPARK-57626
URL: https://issues.apache.org/jira/browse/SPARK-57626
Project: Spark
Issue Type: Improvement
Components: SQL
Affects Versions: 4.3.0
Reporter: Chao Sun
Assignee: Chao Sun
SPARK-47670 introduced opt-in shared parsing for repeated get_json_object calls
over the same JSON input. Its implementation intentionally shares only simple
top-level fields, so repeated literal nested object paths still parse the input
independently.
For example:
{code:sql}
SELECT
get_json_object(json, '$.payload.user.id') AS user_id,
get_json_object(json, '$.payload.user.name') AS user_name,
get_json_object(json, '$.payload.request_id') AS request_id
FROM events
{code}
With spark.sql.optimizer.getJsonObjectSharedParsing.enabled=true, these
prefix-free named paths should be extracted in one streaming scan without
requiring any query changes.
The follow-up should preserve the existing get_json_object behavior for
malformed input, duplicate keys, nulls, and rendering failures. An ancestor and
its descendant must not share the same parse, because each requested path needs
independent legacy semantics. Dynamic paths, wildcards, array subscripts, and
excessively deep paths should continue using the existing evaluation.
This is distinct from SPARK-53764, which collapses nested get_json_object
function calls rather than sharing sibling paths over the same input.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]