clintropolis commented on a change in pull request #6397: Adds bloom filter
aggregator to 'druid-bloom-filters' extension
URL: https://github.com/apache/incubator-druid/pull/6397#discussion_r223499886
##########
File path: docs/content/development/extensions-core/bloom-filter.md
##########
@@ -42,4 +50,53 @@ Internally, this implementation of bloom filter uses
Murmur3 fast non-cryptograp
- 1 big endian int(That is how OutputStream works) for the number of longs in
the bitset
- big endian longs in the BloomKFilter bitset
-Note: `org.apache.hive.common.util.BloomKFilter` provides a serialize method
which can be used to serialize bloom filters to outputStream.
\ No newline at end of file
+Note: `org.apache.hive.common.util.BloomKFilter` provides a serialize method
which can be used to serialize bloom filters to outputStream.
+
+## Bloom Filter Query Aggregator
+Input for a `bloomKFilter` can also be created from a druid query with the
`bloom` aggregator.
+
+### JSON Specification of Bloom Filter Aggregator
+```json
+{
+ "type": "bloomFilter",
+ "name": <output_field_name>,
+ "maxNumEntries": <maximum_number_of_elements_for_BloomKFilter>
+ "field": <dimension_spec>
+ }
+```
+
+|Property |Description |required?
|
+|-------------------------|------------------------------|----------------------------------|
+|`type` |Aggregator Type. Should always be `bloom`|yes|
+|`name` |Output field name |yes|
+|`field` |[DimensionSpec](./../dimensionspecs.html) to add to
`org.apache.hive.common.util.BloomKFilter` | yes |
+|`maxNumEntries` |Maximum number of distinct values supported by
`org.apache.hive.common.util.BloomKFilter`, default `1500`| no |
Review comment:
Hmm, digging into it, in `BloomKFilter` the [false positive rate is not
controllable](https://github.com/apache/hive/blob/master/storage-api/src/java/org/apache/hive/common/util/BloomKFilter.java#L61)
in the manner of `BloomFilter`, and is fixed to the default of 5%. However I
guess that can be indirectly controlled by increasing the `maxNumEntries`,
though that's kind of lame. Having a higher cardinality than the value of
`maxNumEntries` will cause the false positive probability to reach 1,
constructing a useless bloom filter, so that should definitely be added to the
docs.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]