Hi Jian,

Thanks for this proposal! I think it's a valuable feature for Impala. Given
that we don't support the json_tuple() UDTF like Hive does, we have little
flexibility in processing json string columns: we just have a
get_json_object() function in this area. Although the above example can be
implemented as

 - SELECT COUNT(*) FROM events WHERE get_json_object(products_detail,
"$.items[2]") IS NOT NULL AND event_name = 'SubmitOrder'

I believe there are a lot more examples we can't workaround. Feel free to
file a JIRA for adding these builtin functions. Design docs will be welcome!

One thing we should take care of is the performance. If we need to compile
the CEL expr for each row, it's hard to leverage the codegen functionality
of Impala. In reality, the json string columns usually have
flexible results. it's possible that in some rows "items" is a string and
in other rows "items" is a list. Then we can only interpret "items.size()"
based on its type, which should be done after parsing the json string.

To overcome this, we may need another argument to explicitly declare the
json schema. I.e.

 - bool cel_json_eval_bool(string input_json, string json_schema, string
cel_expression):

Looking forward to ideas from other guys.

Thanks,
Quanlong

On Mon, Jun 6, 2022 at 7:58 PM Jian Zhang <zjsar...@gmail.com> wrote:

> Hi community,
>
> Some of our users store JSON/FlatBuffer data in Impala tables. They wish to
> perform flexible queries on these data. Common Expression Language (CEL) is
> a simple expression language built on top of protocol buffer types. It
> would be nice if there are some built-in functions to support query
> JSON/FlatBuffer objects with CEL. For example:
>
>    - bigint cel_json_eval_int(string input_json, string cel_expression):
>    queries the input_json with cel_expression, returns a bigint type.
>    - bigint cel_json_eval_bool(string input_json, string cel_expression):
>    queries the input_json with cel_expression, returns a boolean type.
>    - string cel_json_eval_string(string input_json, string cel_expression):
>    queries the input_json with cel_expression, returns a string type.
>    - ...
>
>
> With the support of these functions and CEL evaluation, users can write the
> following query to:
>
>    - SELECT COUNT(*) FROM events WHERE cel_json_eval_bool(products_detail,
>    "items.size()>=3") AND event_name = 'SubmitOrder'
>
>
> The cel_expression can be any value as long as it follows the CEL language
> definition (https://github.com/google/cel-spec/blob/master/doc/langdef.md
> ).
> As a reference, google has a CEL implementation here:
> https://github.com/google/cel-cpp
>
> Best regards,
> Jian Zhang.
>

Reply via email to