vivek807 commented on issue #17440: URL: https://github.com/apache/druid/issues/17440#issuecomment-2465963879
> Oops, just noticed this issue, sorry for the delayed response. > > fwiw a better way to write this query is > > ``` > SELECT > ARRAY_CONTAINS(JSON_VALUE(agent, '$.type' RETURNING VARCHAR ARRAY), 'Browser') > ... > FROM "kttm_nested_1" > ``` > > since `JSON_VALUE` expressions can be optimized quite significantly* compared to `JSON_QUERY`/`JSON_QUERY_ARRAY` which process the raw json instead of the specialized nested field columns that are created when ingesting json columns. The * is because this is slightly dependent on which Druid version created the json column, since in 28+ nested arrays of primitive types are stored much more optimally then in earlier versions of Druid, so if the segments were created with an older version then json_value would also fall back to processing the raw data to construct the array. > > Also a side-note, the `ARRAY_` functions should be preferred over the `MV_` functions when interacting with actual array types such as stored in json columns and array columns. the `MV_` functions are primarily for use with Druids older non-standard multi-value string columns, which present themselves as `VARCHAR` in the SQL layer and are sort of distinct from actual arrays, so mixing these up can sometimes result in strange type inference in the SQL layer when composing more complicated expressions. > > Anyway, the reason the query is failing in this form is because of an optimization for `ARRAY_CONTAINS` that maybe shouldn't happen when used with `JSON_QUERY_ARRAY`, or rather when the native `array_contains` expression (which backs both `ARRAY_CONTAINS` and `MV_CONTAINS` in SQL) encounters a complex type. `JSON_QUERY_ARRAY` doesn't really know the type of the element it is extracting, so it is handled as `COMPLEX<json>`, however the specialization check for primitive arrays is trying to cast the rhs argument to match the element type of the array it will be checking in the case it is a literal. > > This can be fixed by modifying https://github.com/apache/druid/blob/master/processing/src/main/java/org/apache/druid/math/expr/Function.java#L3977 to instead be something like > > ``` > if (lhsType == null || !(lhsType.isPrimitive() || lhsType.isPrimitiveArray())) { > return this; > } > ``` > > where it fixes the issue because it bails out of the optimization path early and so falls back to the per row checking of type information to decide how to processing the function. > > @vivek807 if you'd like to make a PR with that change and add a test for this to https://github.com/apache/druid/blob/master/processing/src/test/java/org/apache/druid/query/expression/NestedDataExpressionsTest.java#L394 I would be happy to approve it, else I can make a PR with a fix later this week. We also need to do `array_overlap(json_query_array(...))` too since it has a similar optimization path and will need the same fix https://github.com/apache/druid/blob/master/processing/src/main/java/org/apache/druid/math/expr/Function.java#L3724). Thanks, @clintropolis, for the detailed explanation. I’ve implemented the fix and updated the unit tests as well. Please review. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
