clintropolis commented on issue #12695: URL: https://github.com/apache/druid/issues/12695#issuecomment-1177133110
Hi @FrankChen021 thanks for having a look, sorry I haven't finished filling out the PR description for #12753 yet, it might have been able to answer some of your questions. >When using json_value function to extract a value, this function does not know what the actual type of the value in advance and maybe there might be some problems. For this, I have actually picked up a syntax that was implemented in the Calcite parser for these functions that allows specifying the type inline, looking something like ```JSON_VALUE(x, '$.some.nested' RETURNING BIGINT)``` The way I have implemented stuff during SQL conversion will also "eat" cast operations, so wrapping `JSON_VALUE` in a `CAST` will plan to the same native virtual column as the 'returning' syntax would, decorated with the expected type. I don't know how standard this part of the syntax is, but it came free from using these functions in Calcite. Underlying `JSON_VALUE` is a thing called the [`NestedFieldVirtualColumn`](https://github.com/apache/druid/pull/12753/files#diff-2aee7245b26958e74829b61d97461262e1a92b2ff7b1a633028eef96b8c1ef62R62) which can make optimized type selectors as asked for by the expected type, as well as supply the column indexes and everything else which provides 'native' druid performance. >I don't know if json_value and json_query are the final SQL function names in your implementation, but I think they're a little confusion for people when they first touch these functions. json_extract is a more widely used function name in other database like MySQL, SQLite, ClickHouse etc, I think we can use it so people are easy to know what these functions do. I did some surveying as well and `JSON_VALUE` and `JSON_QUERY` are also relatively widely used, in fact it looks like even Clickhouse supports them (also big query, ms sql, oracle, etc). That said, it is relatively easy to add additional functions at the SQL layer, including optimized functions that plan into the `NestedFieldVirtualColumn`, you can see 2 of them still hanging out in my PR, `GET_PATH` and `JSON_GET_PATH` were my initial prototype functions and use 'jq' style paths instead of 'JSONPath', [see `NestedDataOperatorConversions`](https://github.com/apache/druid/pull/12753/files#diff-266bdd0d604d7441493d9b07a91dd663acd39472162b3c2e6384ca63b36c3a9dR145) >The final one is about the performance of application of JSONPath in above functions. There is a PR(https://github.com/apache/druid/pull/11467) compares the JSONPath-based implementation and a hand-writing implementation which turns out the latter one has much better performance. I'm not saying this hand-writing implementation should be used but just want to give you a hint. Heh, not to worry, I actually also wrote my own JSONPath parser (also jq parser) that only handles the subset of functionality that we actually support for these columns [see `NestedPathFinder`](https://github.com/apache/druid/pull/12753/files#diff-58d96677255bd0fd215bdf5d7d519181c48df622a30d957c0124acfd399bd672R67). I actually missed #11467, though I guess there is a bit of difference between what's going on in that PR, which looks like it has functions to handle stringified json inputs, vs what I'm proposing here, which is a new column type that decomposes the JSON at ingest time into indexed nested columns, but I'll have a look to see if there is any overlap. Thanks for pointing this out. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
