clintropolis opened a new pull request #11184:
URL: https://github.com/apache/druid/pull/11184


   ### Description
   This PR adds vectorization support for the Druid native expression logical 
operators, `!`, `&&`, `||`, as well as boolean functions `isnull`, `notnull`, 
and `nvl`.
   
   The `&&`, `||`, and `nvl` implementations are not quite as optimal as they 
could be, since they will evaluate all arguments up front, but I will fix this 
up in another branch I have as a follow-up (split from this one because it was 
starting to get big). The follow-up will add vectorized conditional 
expressions, and introduces a filtered vector binding that uses vector matches 
to allow processing only a subset of input rows. This filtered binding will 
allow slightly modifying these implementations so that `||` only needs to 
evaluate the rhs rows for which the lhs is false or null, `&&` where lhs is 
true or null, and `nvl` only when lhs is null.
   
   In the process of doing this PR and writing tests, I came to the conclusion 
that our logical operators behave very strangely all the time, and I think 
quite wrong in SQL compatible null handling mode since null was always treated 
as "false" instead of "unknown", so I've proposed changing the behavior in this 
PR.
   
   The first change is around output. Previously the logical operators would 
pass values through. For Druid numeric values, any value greater than 0 is 
`true`, so passing through values would result in some strange but technically 
correct outputs. The new behavior homogenizes output to always be `LONG` 
boolean values, e.g. `1` or `0` for boolean operations involving any types. 
   
   Previous behavior:
   * `100 && 11` -> `11` 
   * `0.7 || 0.3` -> `0.7`
   * `100 && 0` -> `0`
   * `'troo' && 'true'` -> `'troo'`
   * `'troo' || 'true'` -> `'true'`
   
   New behavior:
   * `100 && 11` -> `1`
   * `0.7 || 0.3` -> `1`
   * `100 && 0` -> `0`
   * `'troo' && 'true'` -> `0`
   * `'troo' || 'true'` -> `1`
   
   etc.
   
   The implicit conversion of `STRING`, `DOUBLE`, and `LONG` values to booleans 
remains in effect:
   * `LONG` or `DOUBLE` - any value greater than 0 is considered `true`, else 
`false`
   * `STRING` - the value `'true'` (case insensitive) is considered `true`, 
everything else is `false`. 
   
   and has been documented.
   
   The second change is that now the logical operators in SQL compatible mode 
will treat `null` as "unknown" when `druid.generic.useDefaultValueForNull` is 
set as `false` (SQL compatible null handling mode).
   
   For the "or" operator:
   * `true || null`, `null || true` -> `1`
   * `false || null`, `null || false`, `null || null`-> `null`
   
   For the "and" operator:
   * `true && null`, `null && true`, `null && null` -> `null`
   * `false && null`, `null && false` -> `0`
   
   
   Since this new behavior changes query results, subtly in the case of default 
mode since the results will be different (but equivalent in terms of true or 
false result), and fairly significantly in SQL compatible mode since it will 
now respect `null` values and treat them differently than always false, this PR 
also adds a new configuration, `druid.expressions.useLegacyLogicalOperators`, 
which defaults to false to use the new behavior in this PR, but allows 
reverting to the existing behavior if desired. I don't really love this flag 
existing, but don't see a way around it unless we are ok with changing query 
results to use the new behavior only. `&&` and `||` will only be vectorized 
when `druid.expressions.useLegacyLogicalOperators=false`, but if there is 
demand for it, implementations could probably be added to provide the legacy 
behavior as well.
   
   benchmarks:
   ```
         // 30: logical and operator
         "SELECT CAST(long1 as BOOLEAN) AND CAST (long2 as BOOLEAN), COUNT(*) 
FROM foo GROUP BY 1 ORDER BY 2",
         // 31: isnull, notnull
         "SELECT long5 IS NULL, long3 IS NOT NULL, count(*) FROM foo GROUP BY 
1,2 ORDER BY 3"
   ```
   ```
   Benchmark                        (query)  (rowsPerSegment)  (vectorize)  
Mode  Cnt    Score    Error  Units
   SqlExpressionBenchmark.querySql       30           5000000        false  
avgt    5  761.667 ± 35.745  ms/op
   SqlExpressionBenchmark.querySql       30           5000000        force  
avgt    5  152.102 ±  9.390  ms/op
   SqlExpressionBenchmark.querySql       31           5000000        false  
avgt    5  431.104 ± 32.951  ms/op
   SqlExpressionBenchmark.querySql       31           5000000        force  
avgt    5  100.884 ±  8.919  ms/op
   ```
   <hr>
   
   
   This PR has:
   - [x] been self-reviewed.
   - [x] added documentation for new or modified features or behaviors.
   - [x] added Javadocs for most classes and all non-trivial methods. Linked 
related entities via Javadoc links.
   - [x] added comments explaining the "why" and the intent of the code 
wherever would not be obvious for an unfamiliar reader.
   - [x] added unit tests or modified existing tests to cover new code paths, 
ensuring the threshold for [code 
coverage](https://github.com/apache/druid/blob/master/dev/code-review/code-coverage.md)
 is met.
   - [x] been tested in a test Druid cluster.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to