clintropolis opened a new pull request, #15521: URL: https://github.com/apache/druid/pull/15521
### Description This PR adds `JSON_QUERY_ARRAY` which is sort of like `JSON_QUERY` but instead of returning `COMPLEX<json>` for any value extracted from some json path, instead returns `ARRAY<COMPLEX<json>>`. This is currently done purely with `ExpressionVirtualColumn` via a `DirectOperatorConversion` rather than using the specialized `NestedFieldVirtualColumn` used by `JSON_VALUE` and `JSON_QUERY`, mostly because there isn't a lot of room for optimization yet, and I would rather wait until the future if we introduce specialized array column selectors than trying to extend the existing selectors of this virtual column to also handle arrays of objects. Similar to other array handling, values which are not arrays will be coerced into single element arrays, though I am open to discussion on this, since it would seem equally valid to handle them as null values... This allows for a lot of useful stuff like using `UNNEST` on arrays of objects, to transform an array of json objects into rows of json objects. For example, using some data sourced from a discussion in a community slack thread, which has top level arrays of objects (would also work with nested arrays of objects at some path) <img width="1184" alt="Screenshot 2023-12-08 at 12 01 39 AM" src="https://github.com/apache/druid/assets/1577461/7b1ecd93-1196-46b5-a0e5-36334856a443"> We can use `JSON_QUERY_ARRAY` to do stuff like translate it to a separate row per object: <img width="899" alt="Screenshot 2023-12-08 at 12 02 33 AM" src="https://github.com/apache/druid/assets/1577461/e6b42909-6757-43da-bfe7-2d3f6d8df39d"> and further use `JSON_VALUE` to extract values from these objects and do stuff like group or aggregate on them: <img width="772" alt="Screenshot 2023-12-08 at 12 04 08 AM" src="https://github.com/apache/druid/assets/1577461/6912aa46-1d10-4ac9-b578-aee929de2d79"> Will add docs in a follow-up PR. #### Release note Added `JSON_QUERY_ARRAY` which is similar to `JSON_QUERY` except the return type is always `ARRAY<COMPLEX<json>>` instead of `COMPLEX<json>`. Essentially, this function allows extracting arrays of objects from nested data and performing operations such as `UNNEST`, `ARRAY_LENGTH`, `ARRAY_SLICE`, or any other available ARRAY operations. <hr> This PR has: - [ ] been self-reviewed. - [ ] added documentation for new or modified features or behaviors. - [x] a release note entry in the PR description. - [ ] added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links. - [x] added unit tests or modified existing tests to cover new code paths, ensuring the threshold for [code coverage](https://github.com/apache/druid/blob/master/dev/code-review/code-coverage.md) is met. - [x] been tested in a test Druid cluster. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
