[GitHub] clintropolis opened a new pull request #6502: bloom filter sql

GitBox Mon, 22 Oct 2018 15:17:35 -0700

clintropolis opened a new pull request #6502: bloom filter sql
URL: https://github.com/apache/incubator-druid/pull/6502
 
 
   This PR adds bloom filter support to druid sql queries for use in where 
clauses, introducing a new sql operator, `BLOOM_FILTER_TEST` which takes an 
expression and a base64 encoded, serialized bloom filter to construct a 
`BloomDimFilter`. 
   
   ```sql
   SELECT COUNT(*) FROM druid.foo WHERE bloom_filter_test(dim1, 
'<serialized_bytes_for_BloomKFilter>')
   ```
   
   With very large bloom filters, I experienced slow planning times, which I 
believe to be related to tokenization/parsing of the large sql expression, so 
I've also added another built-in sql operator, `CONTEXT_LITERAL_LOOKUP` which 
is a unary operator that fetches a literal value from the query context in the 
property `sqlLiteralLookup`, allowing the parser to deal with a much smaller 
sql expression.
   
   ```json
   {
     "query": "SELECT COUNT(*) FROM druid.foo WHERE bloom_filter_test(dim1, 
context_literal_lookup('x'))",
     "context": {
       "sqlLiteralLookup": {
         "x": "'<serialized_bytes_for_BloomKFilter>'"
       }
     }
   }
   ```
   After planning, `sqlLiteralLookup` is dropped from the query context to cut 
down the overall size.
   
   Note that this is still significantly slower than using native druid json 
queries, and with a much larger heap footprint due to planner overhead, but 
it's a start. I think we want to investigate further performance improvements 
to query planning, as being able to cache `DimFilter` artifacts per query would 
be a significant improvement for the bloom filter in particular, as it could 
avoid deserializing the large input into a `BloomDimFilter` multiple times, 
greatly reducing the amount of time spent planning as well as garbage produced.
   
   To support allowing extensions to define filters for sql, `druid-sql` has 
been slightly refactored, with the `SqlOperatorConversion` interface expanded 
to include a new method `toDruidFilter` which is analogous to 
`toDruidExpression` but produces a `DimFilter` instead. This allows filter 
expression parsing to use any `SqlOperatorConversion` in the operators table 
that produces a non-null filter. 
   
   As such, `LIKE` has been refactored from being a special case matched on 
`SqlKind` into `LikeOperatorConversion` which implements `toDruidFilter`. 
`LikeOperatorConversion` and `ContextLiteralLookupOperatorConversion` are the 
only 2 built-in `SqlOperatorConversion` that implement `toDruidFilter`, 
additional "filters" such as '=' and comparisons are handled with specialty 
code in the filter expression parsing logic, but I suspect they could probably 
be translated to implement `toDruidFilter` instead in a future PR.
   
   Additionally a base sql query test class was extracted from 
`CalciteQueryTest` to allow code-reuse for the bloom filter sql stuff.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] clintropolis opened a new pull request #6502: bloom filter sql

Reply via email to