Robert Joseph Evans created SPARK-46778:
-------------------------------------------

             Summary: get_json_object flattens wildcard queries that match a 
single value
                 Key: SPARK-46778
                 URL: https://issues.apache.org/jira/browse/SPARK-46778
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 3.4.1
            Reporter: Robert Joseph Evans


I think this impacts all versions of {{{}get_json_object{}}}, but I am not 100% 
sure.

The unit test for 
[$.store.book[*].reader|https://github.com/apache/spark/blob/39f8e1a5953b5897f893151d24dc585a80c0c8a0/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/JsonExpressionsSuite.scala#L142-L146]
 verifies that the output of this query is a single level JSON array, but when 
I put the same JSON and JSON path into [http://jsonpath.com/] I get a result 
with multiple levels of nesting. It looks like Apache Spark tries to flatten 
lists for {{[*]}} matches when there is only a single element that matches.
{code:java}
scala> 
Seq("""[{"a":"A"},{"b":"B"}]""","""[{"a":"A"},{"a":"B"}]""").toDF("jsonStr").selectExpr("""get_json_object(jsonStr,"$[*].a")""").show(false)
+--------------------------------+
|get_json_object(jsonStr, $[*].a)|
+--------------------------------+
|"A"                             |
|["A","B"]                       |
+--------------------------------+ {code}
But this has problems in that I no longer have a consistent schema returned, 
even if the input schema is known to be consistent. For example if I wanted to 
know how many elements matched this query I could wrap it in a 
{{json_array_length}} but that does not work in the generic case.
{code:java}
scala> 
Seq("""[{"a":"A"},{"b":"B"}]""","""[{"a":"A"},{"a":"B"}]""").toDF("jsonStr").selectExpr("""json_array_length(get_json_object(jsonStr,"$[*].a"))""").show(false)
+---------------------------------------------------+
|json_array_length(get_json_object(jsonStr, $[*].a))|
+---------------------------------------------------+
|null                                               |
|2                                                  |
+---------------------------------------------------+ {code}
If the value returned might be a JSON array, and then I would get a number, but 
it is wrong.
{code:java}
scala> 
Seq("""[{"a":[1,2,3,4,5]},{"b":"B"}]""","""[{"a":[1,2,3,4,5]},{"a":[1,2,3,4,5]}]""").toDF("jsonStr").selectExpr("""json_array_length(get_json_object(jsonStr,"$[*].a"))""").show(false)
+---------------------------------------------------+
|json_array_length(get_json_object(jsonStr, $[*].a))|
+---------------------------------------------------+
|5                                                  |
|2                                                  |
+---------------------------------------------------+ {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to