clintropolis opened a new pull request #6502: bloom filter sql URL: https://github.com/apache/incubator-druid/pull/6502 This PR adds bloom filter support to druid sql queries for use in where clauses, introducing a new sql operator, `BLOOM_FILTER_TEST` which takes an expression and a base64 encoded, serialized bloom filter to construct a `BloomDimFilter`. ```sql SELECT COUNT(*) FROM druid.foo WHERE bloom_filter_test(dim1, '<serialized_bytes_for_BloomKFilter>') ``` With very large bloom filters, I experienced slow planning times, which I believe to be related to tokenization/parsing of the large sql expression, so I've also added another built-in sql operator, `CONTEXT_LITERAL_LOOKUP` which is a unary operator that fetches a literal value from the query context in the property `sqlLiteralLookup`, allowing the parser to deal with a much smaller sql expression. ```json { "query": "SELECT COUNT(*) FROM druid.foo WHERE bloom_filter_test(dim1, context_literal_lookup('x'))", "context": { "sqlLiteralLookup": { "x": "'<serialized_bytes_for_BloomKFilter>'" } } } ``` After planning, `sqlLiteralLookup` is dropped from the query context to cut down the overall size. Note that this is still significantly slower than using native druid json queries, and with a much larger heap footprint due to planner overhead, but it's a start. I think we want to investigate further performance improvements to query planning, as being able to cache `DimFilter` artifacts per query would be a significant improvement for the bloom filter in particular, as it could avoid deserializing the large input into a `BloomDimFilter` multiple times, greatly reducing the amount of time spent planning as well as garbage produced. To support allowing extensions to define filters for sql, `druid-sql` has been slightly refactored, with the `SqlOperatorConversion` interface expanded to include a new method `toDruidFilter` which is analogous to `toDruidExpression` but produces a `DimFilter` instead. This allows filter expression parsing to use any `SqlOperatorConversion` in the operators table that produces a non-null filter. As such, `LIKE` has been refactored from being a special case matched on `SqlKind` into `LikeOperatorConversion` which implements `toDruidFilter`. `LikeOperatorConversion` and `ContextLiteralLookupOperatorConversion` are the only 2 built-in `SqlOperatorConversion` that implement `toDruidFilter`, additional "filters" such as '=' and comparisons are handled with specialty code in the filter expression parsing logic, but I suspect they could probably be translated to implement `toDruidFilter` instead in a future PR. Additionally a base sql query test class was extracted from `CalciteQueryTest` to allow code-reuse for the bloom filter sql stuff.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
