[
https://issues.apache.org/jira/browse/HIVE-21616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Alan Gates updated HIVE-21616:
------------------------------
Status: Patch Available (was: Open)
This patch implements "is [not] json", json_value(), and json_query().
A few notes on implementation choices.
For "is [not] json" should match the spec.
The syntax for json_value and json_query do not match the spec. The spec for
json_value is:
json_value(json_string, path_string [PASSING key AS value[, key AS value...]]
[RETURNING datatype] [ERROR | NULL | DEFAULT val ON EMPTY] [ERROR | NULL |
DEFAULT val on ERROR])
json_query's syntax is similarly wordy.
This verbosity doesn't work well with Hive's parser since it doesn't handle
keywords inside UDF declarations. So I changed the arguments to the following:
json_value(json_string, path_string [, default_value [, action_on_empty [,
action_on_error [, passing_key, passing_value [, passing_key,
passing_value...]]]]]) where
json_string is the JSON to be processed. It can be a string, char, or varchar.
path_string is the path expression to apply to the JSON. It must be constant.
It is parsed once and re-used after that. The implementation of JSON path
mostly matches what is specified in SQL:2016. There are a few exceptions noted
in the code.
default_value fulfills two purposes. First, it tells json_value what type to
return. It takes the type of the default value and, if possible, returns the
result of the path expression as that type. If the following arguments
indicate, it will also use it as a default value. If this argument is omitted
the return type is string. So json_value(jsonstr, '$.name') will return name
as a string. json_value(jsonstr, '$.age', 1) will return age as an integer.
Return types of string, char, varchar, all numeric types, boolean, list, and
struct are supported. Where casts don't make sense (e.g. an int was request
but the object returned by the path expression is a list) then a null will be
returned. The default_value can be any expression, it does not need to be
constant.
action_on_empty can be one of 'DEFAULT', 'NULL', or 'ERROR'. When the path
expression does not match anything in the json string then this value will be
returned. DEFAULT causes it to return the passed default value, NULL returns
null, and ERROR will throw an error. ERROR is not recommended, as most queries
will have at least some non-matching rows. NULL is the default.
action_on_error is the same as on empty, but applies only when an error occurs
in executing the json path expression. NULL is the default. Note that this
only applies to errors in executing the path expression, not parsing it. Parse
errors will be found up front and reported.
passing_key/passing_value allows the user to pass in dynamic values to the path
expression. passing_key must match a variable in the path expression and
json_value must be a meaningful substitute or an error will be raised and
handled per the action_on_error argument.
json_query() takes the same arguments as json_value() with the additional
restriction that the default_value must be a string type. The returned value
will be a valid json fragment.
While I have not implemented json_table, I did implement a cache for the json
strings parsed by json_value() so that a query like:
create view table_from_json as select json_value(jsonstr, '$.name') as name,
json_value(jsonstr, '$.age', 1) as age, ...
from json_source;
would only have to parse jsonstr once per row and each of the json_value
invocations could use that parsed value. This cache keeps records very briefly
(1 sec by default) to keep from bloating. The goal is for the record to last
long enough that each invocation of json_value can see it, keeping in mind that
when vectorized each json_value instance will see every record in the vector
batch before the next instance sees any.
> Implement JSON_VALUE, JSON_QUERY, and IS [NOT] JSON
> ---------------------------------------------------
>
> Key: HIVE-21616
> URL: https://issues.apache.org/jira/browse/HIVE-21616
> Project: Hive
> Issue Type: Sub-task
> Components: UDF
> Affects Versions: 4.0.0
> Reporter: Alan Gates
> Assignee: Alan Gates
> Priority: Major
> Attachments: HIVE-21615.patch
>
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)