Robert Joseph Evans created SPARK-46778:
-------------------------------------------
Summary: get_json_object flattens wildcard queries that match a
single value
Key: SPARK-46778
URL: https://issues.apache.org/jira/browse/SPARK-46778
Project: Spark
Issue Type: Bug
Components: SQL
Affects Versions: 3.4.1
Reporter: Robert Joseph Evans
I think this impacts all versions of {{{}get_json_object{}}}, but I am not 100%
sure.
The unit test for
[$.store.book[*].reader|https://github.com/apache/spark/blob/39f8e1a5953b5897f893151d24dc585a80c0c8a0/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/JsonExpressionsSuite.scala#L142-L146]
verifies that the output of this query is a single level JSON array, but when
I put the same JSON and JSON path into [http://jsonpath.com/] I get a result
with multiple levels of nesting. It looks like Apache Spark tries to flatten
lists for {{[*]}} matches when there is only a single element that matches.
{code:java}
scala>
Seq("""[{"a":"A"},{"b":"B"}]""","""[{"a":"A"},{"a":"B"}]""").toDF("jsonStr").selectExpr("""get_json_object(jsonStr,"$[*].a")""").show(false)
+--------------------------------+
|get_json_object(jsonStr, $[*].a)|
+--------------------------------+
|"A" |
|["A","B"] |
+--------------------------------+ {code}
But this has problems in that I no longer have a consistent schema returned,
even if the input schema is known to be consistent. For example if I wanted to
know how many elements matched this query I could wrap it in a
{{json_array_length}} but that does not work in the generic case.
{code:java}
scala>
Seq("""[{"a":"A"},{"b":"B"}]""","""[{"a":"A"},{"a":"B"}]""").toDF("jsonStr").selectExpr("""json_array_length(get_json_object(jsonStr,"$[*].a"))""").show(false)
+---------------------------------------------------+
|json_array_length(get_json_object(jsonStr, $[*].a))|
+---------------------------------------------------+
|null |
|2 |
+---------------------------------------------------+ {code}
If the value returned might be a JSON array, and then I would get a number, but
it is wrong.
{code:java}
scala>
Seq("""[{"a":[1,2,3,4,5]},{"b":"B"}]""","""[{"a":[1,2,3,4,5]},{"a":[1,2,3,4,5]}]""").toDF("jsonStr").selectExpr("""json_array_length(get_json_object(jsonStr,"$[*].a"))""").show(false)
+---------------------------------------------------+
|json_array_length(get_json_object(jsonStr, $[*].a))|
+---------------------------------------------------+
|5 |
|2 |
+---------------------------------------------------+ {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]