vivek807 commented on issue #17440:
URL: https://github.com/apache/druid/issues/17440#issuecomment-2465963879

   > Oops, just noticed this issue, sorry for the delayed response.
   > 
   > fwiw a better way to write this query is
   > 
   > ```
   > SELECT
   >   ARRAY_CONTAINS(JSON_VALUE(agent, '$.type' RETURNING VARCHAR ARRAY), 
'Browser')
   > ...
   > FROM "kttm_nested_1"
   > ```
   > 
   > since `JSON_VALUE` expressions can be optimized quite significantly* 
compared to `JSON_QUERY`/`JSON_QUERY_ARRAY` which process the raw json instead 
of the specialized nested field columns that are created when ingesting json 
columns. The * is because this is slightly dependent on which Druid version 
created the json column, since in 28+ nested arrays of primitive types are 
stored much more optimally then in earlier versions of Druid, so if the 
segments were created with an older version then json_value would also fall 
back to processing the raw data to construct the array.
   > 
   > Also a side-note, the `ARRAY_` functions should be preferred over the 
`MV_` functions when interacting with actual array types such as stored in json 
columns and array columns. the `MV_` functions are primarily for use with 
Druids older non-standard multi-value string columns, which present themselves 
as `VARCHAR` in the SQL layer and are sort of distinct from actual arrays, so 
mixing these up can sometimes result in strange type inference in the SQL layer 
when composing more complicated expressions.
   > 
   > Anyway, the reason the query is failing in this form is because of an 
optimization for `ARRAY_CONTAINS` that maybe shouldn't happen when used with 
`JSON_QUERY_ARRAY`, or rather when the native `array_contains` expression 
(which backs both `ARRAY_CONTAINS` and `MV_CONTAINS` in SQL) encounters a 
complex type. `JSON_QUERY_ARRAY` doesn't really know the type of the element it 
is extracting, so it is handled as `COMPLEX<json>`, however the specialization 
check for primitive arrays is trying to cast the rhs argument to match the 
element type of the array it will be checking in the case it is a literal.
   > 
   > This can be fixed by modifying 
https://github.com/apache/druid/blob/master/processing/src/main/java/org/apache/druid/math/expr/Function.java#L3977
 to instead be something like
   > 
   > ```
   >         if (lhsType == null || !(lhsType.isPrimitive() || 
lhsType.isPrimitiveArray())) {
   >           return this;
   >         }
   > ```
   > 
   > where it fixes the issue because it bails out of the optimization path 
early and so falls back to the per row checking of type information to decide 
how to processing the function.
   > 
   > @vivek807 if you'd like to make a PR with that change and add a test for 
this to 
https://github.com/apache/druid/blob/master/processing/src/test/java/org/apache/druid/query/expression/NestedDataExpressionsTest.java#L394
 I would be happy to approve it, else I can make a PR with a fix later this 
week. We also need to do `array_overlap(json_query_array(...))` too since it 
has a similar optimization path and will need the same fix 
https://github.com/apache/druid/blob/master/processing/src/main/java/org/apache/druid/math/expr/Function.java#L3724).
   
   Thanks, @clintropolis, for the detailed explanation. I’ve implemented the 
fix and updated the unit tests as well. Please review.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to