clintropolis opened a new pull request #11853:
URL: https://github.com/apache/druid/pull/11853
### Description
This PR adds support for Druid "complex" types to the native expression
processing system, made possible after the type system enhancements done in
#11713. The implications of this are that now it will be possible for _all_
Druid data to be usable within expressions, should expressions be added to
handle these types.
`ObjectBinding`, the non-vectorized expression input data provider, now
implements `ColumnInspector` so that it can retain type information when
available, and a new constant, `ComplexExpr` has been added which accepts the
`ExpressionType` alongside the value to represent these values provided by the
binding.
Several generic nullable value binary serde methods for types have been
moved out of `ExprEval` and into `Types`, to hopefully be more generally
available for writing nullable values that follow the `| null (byte) | value
(byte[]) |` pattern, which is now all of the `ExprEval` types. I've adusted the
binary formats slightly to be more consistent, so there are some minor changes
to the expression buffer aggregator, but this should have no compatibility
issues because this format is not written to segments anywhere, and contained
within processing of a single node.
A base interface has been extracted from `ObjectStrategy` in
`druid-processing`, which is called `ObjectByteStrategy` because naming is hard
and lives in `druid-core`, to provide conversion between object and binary
format for complex types. A registry of these `ObjectByteStrategy` to type name
has been added to hold these, and registering a `ComplexMetricsSerde` in
`ComplexMetrics` will automatically register its `ObjectStrategy` in the lower
level `ObjectByteStrategy` registry. This would be less messy if `druid-core`
and `druid-processing` were just merged since the `ComplexMetrics` registry
could just be used directly for binary serialization of expressions, but.. they
are not yet.
To showcase the new complex expressions, I have added 3 new bloom filter
expressions to the `druid-bloom-filter` extension:
* `bloom_filter(expr)` - creates a bloom filter with expected capacity `expr`
* `bloom_filter_test(expr1, expr2)` - checks if `expr2` is contained in the
bloom filter `expr1`
* `bloom_filter_add(expr1, expr2)` - adds `expr2` to bloom filter `expr1`.
*
I have not documented these yet, because I'm still considering how to
position them, and there are several parts of the expression system which are
still missing documentation for the same reason like the native expression
aggregator. I have also not wired these up to SQL functions yet for similar
reasons.
With these expressions, it is possible for example to even re-create the
native bloom filter aggregator - instead using the expression aggregator:
```
{
"type": "expression",
"name": "bloom_expression",
"fields": ["user"],
"initialValue": "bloom_filter(10000)",
"fold": "bloom_filter_add(user, __acc)",
"maxSizeBytes": 8096
}
```
but I think this is just scratching the surface of what this change will
make possible.
<img width="1413" alt="Screen Shot 2021-09-25 at 4 52 31 PM"
src="https://user-images.githubusercontent.com/1577461/139329494-ba3f3f28-06f2-498f-b42a-1b32054ec622.png">
### Future work
Implementing additional expressions for other complex type extensions, such
as data sketches, etc.
<hr>
##### Key changed/added classes in this PR
* `ExprEval`
* `Types`
* `ObjectStrategy`
<hr>
This PR has:
- [x] been self-reviewed.
- [ ] added documentation for new or modified features or behaviors.
- [x] added Javadocs for most classes and all non-trivial methods. Linked
related entities via Javadoc links.
- [x] added comments explaining the "why" and the intent of the code
wherever would not be obvious for an unfamiliar reader.
- [x] added unit tests or modified existing tests to cover new code paths,
ensuring the threshold for [code
coverage](https://github.com/apache/druid/blob/master/dev/code-review/code-coverage.md)
is met.
- [ ] added integration tests.
- [x] been tested in a test Druid cluster.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]