clintropolis opened a new pull request #11184:
URL: https://github.com/apache/druid/pull/11184
### Description
This PR adds vectorization support for the Druid native expression logical
operators, `!`, `&&`, `||`, as well as boolean functions `isnull`, `notnull`,
and `nvl`.
The `&&`, `||`, and `nvl` implementations are not quite as optimal as they
could be, since they will evaluate all arguments up front, but I will fix this
up in another branch I have as a follow-up (split from this one because it was
starting to get big). The follow-up will add vectorized conditional
expressions, and introduces a filtered vector binding that uses vector matches
to allow processing only a subset of input rows. This filtered binding will
allow slightly modifying these implementations so that `||` only needs to
evaluate the rhs rows for which the lhs is false or null, `&&` where lhs is
true or null, and `nvl` only when lhs is null.
In the process of doing this PR and writing tests, I came to the conclusion
that our logical operators behave very strangely all the time, and I think
quite wrong in SQL compatible null handling mode since null was always treated
as "false" instead of "unknown", so I've proposed changing the behavior in this
PR.
The first change is around output. Previously the logical operators would
pass values through. For Druid numeric values, any value greater than 0 is
`true`, so passing through values would result in some strange but technically
correct outputs. The new behavior homogenizes output to always be `LONG`
boolean values, e.g. `1` or `0` for boolean operations involving any types.
Previous behavior:
* `100 && 11` -> `11`
* `0.7 || 0.3` -> `0.7`
* `100 && 0` -> `0`
* `'troo' && 'true'` -> `'troo'`
* `'troo' || 'true'` -> `'true'`
New behavior:
* `100 && 11` -> `1`
* `0.7 || 0.3` -> `1`
* `100 && 0` -> `0`
* `'troo' && 'true'` -> `0`
* `'troo' || 'true'` -> `1`
etc.
The implicit conversion of `STRING`, `DOUBLE`, and `LONG` values to booleans
remains in effect:
* `LONG` or `DOUBLE` - any value greater than 0 is considered `true`, else
`false`
* `STRING` - the value `'true'` (case insensitive) is considered `true`,
everything else is `false`.
and has been documented.
The second change is that now the logical operators in SQL compatible mode
will treat `null` as "unknown" when `druid.generic.useDefaultValueForNull` is
set as `false` (SQL compatible null handling mode).
For the "or" operator:
* `true || null`, `null || true` -> `1`
* `false || null`, `null || false`, `null || null`-> `null`
For the "and" operator:
* `true && null`, `null && true`, `null && null` -> `null`
* `false && null`, `null && false` -> `0`
Since this new behavior changes query results, subtly in the case of default
mode since the results will be different (but equivalent in terms of true or
false result), and fairly significantly in SQL compatible mode since it will
now respect `null` values and treat them differently than always false, this PR
also adds a new configuration, `druid.expressions.useLegacyLogicalOperators`,
which defaults to false to use the new behavior in this PR, but allows
reverting to the existing behavior if desired. I don't really love this flag
existing, but don't see a way around it unless we are ok with changing query
results to use the new behavior only. `&&` and `||` will only be vectorized
when `druid.expressions.useLegacyLogicalOperators=false`, but if there is
demand for it, implementations could probably be added to provide the legacy
behavior as well.
benchmarks:
```
// 30: logical and operator
"SELECT CAST(long1 as BOOLEAN) AND CAST (long2 as BOOLEAN), COUNT(*)
FROM foo GROUP BY 1 ORDER BY 2",
// 31: isnull, notnull
"SELECT long5 IS NULL, long3 IS NOT NULL, count(*) FROM foo GROUP BY
1,2 ORDER BY 3"
```
```
Benchmark (query) (rowsPerSegment) (vectorize)
Mode Cnt Score Error Units
SqlExpressionBenchmark.querySql 30 5000000 false
avgt 5 761.667 ± 35.745 ms/op
SqlExpressionBenchmark.querySql 30 5000000 force
avgt 5 152.102 ± 9.390 ms/op
SqlExpressionBenchmark.querySql 31 5000000 false
avgt 5 431.104 ± 32.951 ms/op
SqlExpressionBenchmark.querySql 31 5000000 force
avgt 5 100.884 ± 8.919 ms/op
```
<hr>
This PR has:
- [x] been self-reviewed.
- [x] added documentation for new or modified features or behaviors.
- [x] added Javadocs for most classes and all non-trivial methods. Linked
related entities via Javadoc links.
- [x] added comments explaining the "why" and the intent of the code
wherever would not be obvious for an unfamiliar reader.
- [x] added unit tests or modified existing tests to cover new code paths,
ensuring the threshold for [code
coverage](https://github.com/apache/druid/blob/master/dev/code-review/code-coverage.md)
is met.
- [x] been tested in a test Druid cluster.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]