techdocsmith commented on code in PR #17547:
URL: https://github.com/apache/druid/pull/17547#discussion_r1876947609
##########
docs/development/extensions-core/bloom-filter.md:
##########
@@ -23,28 +23,25 @@ title: "Bloom Filter"
-->
-To use this Apache Druid extension,
[include](../../configuration/extensions.md#loading-extensions)
`druid-bloom-filter` in the extensions load list.
+To use the Apache Druid® Bloom filter extension, include
`druid-bloom-filter` in the extensions load list. See [Loading
extensions](../../configuration/extensions.md#loading-extensions) for more
information.
-This extension adds the ability to both construct bloom filters from query
results, and filter query results by testing
-against a bloom filter. A Bloom filter is a probabilistic data structure for
performing a set membership check. A bloom
-filter is a good candidate to use with Druid for cases where an explicit
filter is impossible, e.g. filtering a query
+This extension adds the ability to both construct Bloom filters from query
results, and filter query results by testing
+against a Bloom filter. A Bloom filter is a probabilistic data structure for
performing a set membership check. A Bloom
+filter is a good candidate to use with Druid for cases where an explicit
filter is impossible, such as filtering a query
Review Comment:
```suggestion
filter is a good candidate to use when an explicit filter is impossible,
such as filtering a query
```
Don't think we need "for Druid" since these are the Druid docs
##########
docs/development/extensions-core/bloom-filter.md:
##########
@@ -23,28 +23,25 @@ title: "Bloom Filter"
-->
-To use this Apache Druid extension,
[include](../../configuration/extensions.md#loading-extensions)
`druid-bloom-filter` in the extensions load list.
+To use the Apache Druid® Bloom filter extension, include
`druid-bloom-filter` in the extensions load list. See [Loading
extensions](../../configuration/extensions.md#loading-extensions) for more
information.
-This extension adds the ability to both construct bloom filters from query
results, and filter query results by testing
-against a bloom filter. A Bloom filter is a probabilistic data structure for
performing a set membership check. A bloom
-filter is a good candidate to use with Druid for cases where an explicit
filter is impossible, e.g. filtering a query
+This extension adds the ability to both construct Bloom filters from query
results, and filter query results by testing
+against a Bloom filter. A Bloom filter is a probabilistic data structure for
performing a set membership check. A Bloom
+filter is a good candidate to use with Druid for cases where an explicit
filter is impossible, such as filtering a query
against a set of millions of values.
Following are some characteristics of Bloom filters:
-- Bloom filters are highly space efficient when compared to using a HashSet.
-- Because of the probabilistic nature of bloom filters, false positive results
are possible (element was not actually
-inserted into a bloom filter during construction, but `test()` says true)
-- False negatives are not possible (if element is present then `test()` will
never say false).
-- The false positive probability of this implementation is currently fixed at
5%, but increasing the number of entries
-that the filter can hold can decrease this false positive rate in exchange for
overall size.
-- Bloom filters are sensitive to number of elements that will be inserted in
the bloom filter. During the creation of bloom filter expected number of
entries must be specified. If the number of insertions exceed
- the specified initial number of entries then false positive probability will
increase accordingly.
+- Bloom filters are highly space efficient compared to using a HashSet.
Review Comment:
```suggestion
- Bloom filters are significantly more space efficient than HashSets.
```
##########
docs/development/extensions-core/bloom-filter.md:
##########
@@ -70,50 +68,46 @@ This string can then be used in the native or SQL Druid
query.
}
```
-|Property |Description |required?
|
-|-------------------------|------------------------------|----------------------------------|
-|`type` |Filter Type. Should always be `bloom`|yes|
-|`dimension` |The dimension to filter over. | yes |
-|`bloomKFilter` |Base64 encoded Binary representation of
`org.apache.hive.common.util.BloomKFilter`| yes |
-|`extractionFn`|[Extraction
function](../../querying/dimensionspecs.md#extraction-functions) to apply to
the dimension values |no|
-
+|Property|Description|Required|
+|--------|-----------|--------|
+|`type`|Filter type. Should always be `bloom`.|Yes|
Review Comment:
```suggestion
|`type`|Filter type. Set to `bloom`.|Yes|
```
##########
docs/development/extensions-core/bloom-filter.md:
##########
@@ -70,50 +68,46 @@ This string can then be used in the native or SQL Druid
query.
}
```
-|Property |Description |required?
|
-|-------------------------|------------------------------|----------------------------------|
-|`type` |Filter Type. Should always be `bloom`|yes|
-|`dimension` |The dimension to filter over. | yes |
-|`bloomKFilter` |Base64 encoded Binary representation of
`org.apache.hive.common.util.BloomKFilter`| yes |
-|`extractionFn`|[Extraction
function](../../querying/dimensionspecs.md#extraction-functions) to apply to
the dimension values |no|
-
+|Property|Description|Required|
+|--------|-----------|--------|
+|`type`|Filter type. Should always be `bloom`.|Yes|
+|`dimension`|Dimension to filter over.|Yes|
+|`bloomKFilter`|Base64 encoded binary representation of
`org.apache.hive.common.util.BloomKFilter`.|Yes|
+|`extractionFn`|[Extraction
function](../../querying/dimensionspecs.md#extraction-functions) to apply to
the dimension values.|No|
-### Serialized Format for BloomKFilter
+### Serialized format for BloomKFilter
- Serialized BloomKFilter format:
+Serialized BloomKFilter format:
- - 1 byte for the number of hash functions.
- - 1 big endian int(That is how OutputStream works) for the number of longs in
the bitset
- - big endian longs in the BloomKFilter bitset
+- 1 byte for the number of hash functions.
+- 1 big-endian integer for the number of longs in the bitset.
+- Big-endian longs in the BloomKFilter bitset.
-Note: `org.apache.hive.common.util.BloomKFilter` provides a serialize method
which can be used to serialize bloom filters to outputStream.
+`org.apache.hive.common.util.BloomKFilter` provides a method to serialize
Bloom filters to `outputStream`.
-### Filtering SQL Queries
+### Filter SQL queries
-Bloom filters can be used in SQL `WHERE` clauses via the `bloom_filter_test`
operator:
+You can use Bloom filters in SQL `WHERE` clauses with the `bloom_filter_test`
operator:
```sql
SELECT COUNT(*) FROM druid.foo WHERE bloom_filter_test(<expr>,
'<serialized_bytes_for_BloomKFilter>')
```
-### Expression and Virtual Column Support
+### Expression and virtual column support
-The bloom filter extension also adds a bloom filter [Druid
expression](../../querying/math-expr.md) which shares syntax
+The Bloom filter extension also adds a Bloom filter [Druid
expression](../../querying/math-expr.md) which shares syntax
with the SQL operator.
```sql
bloom_filter_test(<expr>, '<serialized_bytes_for_BloomKFilter>')
```
-## Bloom Filter Query Aggregator
+## Bloom filter query aggregator
-Input for a `bloomKFilter` can also be created from a druid query with the
`bloom` aggregator. Note that it is very
-important to set a reasonable value for the `maxNumEntries` parameter, which
is the maximum number of distinct entries
-that the bloom filter can represent without increasing the false positive
rate. It may be worth performing a query using
-one of the unique count sketches to calculate the value for this parameter in
order to build a bloom filter appropriate
-for the query.
+You can create an input for a `BloomKFilter` from a Druid query with the
`bloom` aggregator. Make sure to set a reasonable value for the `maxNumEntries`
parameter, which is the maximum number of distinct entries that the Bloom
filter can represent without increasing the false positive rate. It may be
worth performing a query using
Review Comment:
```suggestion
You can create an input for a `BloomKFilter` from a Druid query with the
`bloom` aggregator. Make sure to set a reasonable value for the `maxNumEntries`
parameter to specify the maximum number of distinct entries that the Bloom
filter can represent without increasing the false positive rate. Try performing
a query using
```
##########
docs/development/extensions-core/bloom-filter.md:
##########
@@ -56,11 +53,12 @@ BloomKFilter.serialize(byteArrayOutputStream, bloomFilter);
String base64Serialized =
Base64.encodeBase64String(byteArrayOutputStream.toByteArray());
```
-This string can then be used in the native or SQL Druid query.
+You can then use this string in the native or SQL Druid query.
Review Comment:
```suggestion
You can then use the Base64 encoded string in JSON-based or SQL-based
queries in Druid.
```
##########
docs/development/extensions-core/bloom-filter.md:
##########
@@ -23,28 +23,25 @@ title: "Bloom Filter"
-->
-To use this Apache Druid extension,
[include](../../configuration/extensions.md#loading-extensions)
`druid-bloom-filter` in the extensions load list.
+To use the Apache Druid® Bloom filter extension, include
`druid-bloom-filter` in the extensions load list. See [Loading
extensions](../../configuration/extensions.md#loading-extensions) for more
information.
-This extension adds the ability to both construct bloom filters from query
results, and filter query results by testing
-against a bloom filter. A Bloom filter is a probabilistic data structure for
performing a set membership check. A bloom
-filter is a good candidate to use with Druid for cases where an explicit
filter is impossible, e.g. filtering a query
+This extension adds the ability to both construct Bloom filters from query
results, and filter query results by testing
+against a Bloom filter. A Bloom filter is a probabilistic data structure for
performing a set membership check. A Bloom
Review Comment:
```suggestion
against a Bloom filter. A Bloom filter is a probabilistic data structure to
check for set membership. A Bloom
```
##########
docs/development/extensions-core/bloom-filter.md:
##########
@@ -124,15 +118,17 @@ for the query.
}
```
-|Property |Description |required?
|
-|-------------------------|------------------------------|----------------------------------|
-|`type` |Aggregator Type. Should always be `bloom`|yes|
-|`name` |Output field name |yes|
-|`field` |[DimensionSpec](../../querying/dimensionspecs.md)
to add to `org.apache.hive.common.util.BloomKFilter` | yes |
-|`maxNumEntries` |Maximum number of distinct values supported by
`org.apache.hive.common.util.BloomKFilter`, default `1500`| no |
+|Property|Description|Required|
+|--------|-----------|--------|
+|`type`|Aggregator type. Should always be `bloom`.|Yes|
Review Comment:
```suggestion
|`type`|Aggregator type. Set to `bloom`.|Yes|
```
##########
docs/development/extensions-core/bloom-filter.md:
##########
@@ -23,28 +23,25 @@ title: "Bloom Filter"
-->
-To use this Apache Druid extension,
[include](../../configuration/extensions.md#loading-extensions)
`druid-bloom-filter` in the extensions load list.
+To use the Apache Druid® Bloom filter extension, include
`druid-bloom-filter` in the extensions load list. See [Loading
extensions](../../configuration/extensions.md#loading-extensions) for more
information.
-This extension adds the ability to both construct bloom filters from query
results, and filter query results by testing
-against a bloom filter. A Bloom filter is a probabilistic data structure for
performing a set membership check. A bloom
-filter is a good candidate to use with Druid for cases where an explicit
filter is impossible, e.g. filtering a query
+This extension adds the ability to both construct Bloom filters from query
results, and filter query results by testing
+against a Bloom filter. A Bloom filter is a probabilistic data structure for
performing a set membership check. A Bloom
+filter is a good candidate to use with Druid for cases where an explicit
filter is impossible, such as filtering a query
against a set of millions of values.
Following are some characteristics of Bloom filters:
-- Bloom filters are highly space efficient when compared to using a HashSet.
-- Because of the probabilistic nature of bloom filters, false positive results
are possible (element was not actually
-inserted into a bloom filter during construction, but `test()` says true)
-- False negatives are not possible (if element is present then `test()` will
never say false).
-- The false positive probability of this implementation is currently fixed at
5%, but increasing the number of entries
-that the filter can hold can decrease this false positive rate in exchange for
overall size.
-- Bloom filters are sensitive to number of elements that will be inserted in
the bloom filter. During the creation of bloom filter expected number of
entries must be specified. If the number of insertions exceed
- the specified initial number of entries then false positive probability will
increase accordingly.
+- Bloom filters are highly space efficient compared to using a HashSet.
+- Because of the probabilistic nature of Bloom filters, false positive results
are possible. For example, the `test()` function might return `true` for an
element that wasn't inserted into the filter.
+- False negatives are not possible. If an element is present, `test()` always
returns `true`.
+- The false positive probability of this implementation is fixed at 5%.
Increasing the number of entries that the filter can hold can decrease this
false positive rate in exchange for overall size.
+- Bloom filters are sensitive to the number of inserted elements. You must
specify the expected number of entries when you create the Bloom filter. If the
number of insertions exceeds the specified number of entries, the false
positive probability increases accordingly.
-This extension is currently based on
`org.apache.hive.common.util.BloomKFilter` from `hive-storage-api`. Internally,
+This extension is based on `org.apache.hive.common.util.BloomKFilter` from
`hive-storage-api`. Internally,
this implementation uses Murmur3 as the hash algorithm.
-To construct a BloomKFilter externally with Java to use as a filter in a Druid
query:
+The following example shows how to construct a BloomKFilter externally with
Java to use as a filter in a Druid query:
Review Comment:
```suggestion
The following Java example shows how to construct a BloomKFilter externally:
```
##########
docs/development/extensions-core/bloom-filter.md:
##########
@@ -154,25 +150,25 @@ for the query.
}
```
-response
+Example response:
```json
-[{"timestamp":"2015-09-12T00:00:00.000Z","result":{"userBloom":"BAAAJhAAAA..."}}]
+[
+ {
+ "timestamp":"2015-09-12T00:00:00.000Z",
+ "result":{"userBloom":"BAAAJhAAAA..."}
+ }
+]
```
-These values can then be set in the filter specification described above.
-
-Ordering results by a bloom filter aggregator, for example in a TopN query,
will perform a comparatively expensive
-linear scan _of the filter itself_ to count the number of set bits as a means
of approximating how many items have been
-added to the set. As such, ordering by an alternate aggregation is recommended
if possible.
+Ordering results by a Bloom filter aggregator, for example in a TopN query,
can be resource-intensive. This is because the operation performs an expensive
linear scan of the filter to approximate the count of items added to the set by
counting the number of set bits. We recommend ordering by an alternative
aggregation method.
Review Comment:
```suggestion
We recommend ordering by an alternative aggregation method instead of
ordering results by a Bloom filter aggregator.
Ordering results by a Bloom filter aggregator can be resource-intensive
because Druid performs an expensive linear scan of the filter to approximate
the count of items added to the set by counting the number of set bits.
```
Suggest putting the recommendation first. Then the explanation.
##########
docs/development/extensions-core/bloom-filter.md:
##########
@@ -23,28 +23,25 @@ title: "Bloom Filter"
-->
-To use this Apache Druid extension,
[include](../../configuration/extensions.md#loading-extensions)
`druid-bloom-filter` in the extensions load list.
+To use the Apache Druid® Bloom filter extension, include
`druid-bloom-filter` in the extensions load list. See [Loading
extensions](../../configuration/extensions.md#loading-extensions) for more
information.
-This extension adds the ability to both construct bloom filters from query
results, and filter query results by testing
-against a bloom filter. A Bloom filter is a probabilistic data structure for
performing a set membership check. A bloom
-filter is a good candidate to use with Druid for cases where an explicit
filter is impossible, e.g. filtering a query
+This extension adds the ability to both construct Bloom filters from query
results, and filter query results by testing
Review Comment:
```suggestion
This extension adds the abilities to construct Bloom filters from query
results and to filter query results by testing
```
##########
docs/development/extensions-core/bloom-filter.md:
##########
@@ -23,28 +23,25 @@ title: "Bloom Filter"
-->
-To use this Apache Druid extension,
[include](../../configuration/extensions.md#loading-extensions)
`druid-bloom-filter` in the extensions load list.
+To use the Apache Druid® Bloom filter extension, include
`druid-bloom-filter` in the extensions load list. See [Loading
extensions](../../configuration/extensions.md#loading-extensions) for more
information.
-This extension adds the ability to both construct bloom filters from query
results, and filter query results by testing
-against a bloom filter. A Bloom filter is a probabilistic data structure for
performing a set membership check. A bloom
-filter is a good candidate to use with Druid for cases where an explicit
filter is impossible, e.g. filtering a query
+This extension adds the ability to both construct Bloom filters from query
results, and filter query results by testing
+against a Bloom filter. A Bloom filter is a probabilistic data structure for
performing a set membership check. A Bloom
+filter is a good candidate to use with Druid for cases where an explicit
filter is impossible, such as filtering a query
against a set of millions of values.
Following are some characteristics of Bloom filters:
-- Bloom filters are highly space efficient when compared to using a HashSet.
-- Because of the probabilistic nature of bloom filters, false positive results
are possible (element was not actually
-inserted into a bloom filter during construction, but `test()` says true)
-- False negatives are not possible (if element is present then `test()` will
never say false).
-- The false positive probability of this implementation is currently fixed at
5%, but increasing the number of entries
-that the filter can hold can decrease this false positive rate in exchange for
overall size.
-- Bloom filters are sensitive to number of elements that will be inserted in
the bloom filter. During the creation of bloom filter expected number of
entries must be specified. If the number of insertions exceed
- the specified initial number of entries then false positive probability will
increase accordingly.
+- Bloom filters are highly space efficient compared to using a HashSet.
+- Because of the probabilistic nature of Bloom filters, false positive results
are possible. For example, the `test()` function might return `true` for an
element that wasn't inserted into the filter.
+- False negatives are not possible. If an element is present, `test()` always
returns `true`.
+- The false positive probability of this implementation is fixed at 5%.
Increasing the number of entries that the filter can hold can decrease this
false positive rate in exchange for overall size.
+- Bloom filters are sensitive to the number of inserted elements. You must
specify the expected number of entries when you create the Bloom filter. If the
number of insertions exceeds the specified number of entries, the false
positive probability increases accordingly.
Review Comment:
```suggestion
- Bloom filters are sensitive to the number of inserted elements. You must
specify the expected number of entries at creation time. If the number of
insertions exceeds the specified number of entries, the false positive
probability increases accordingly.
```
##########
docs/development/extensions-core/bloom-filter.md:
##########
@@ -23,28 +23,25 @@ title: "Bloom Filter"
-->
-To use this Apache Druid extension,
[include](../../configuration/extensions.md#loading-extensions)
`druid-bloom-filter` in the extensions load list.
+To use the Apache Druid® Bloom filter extension, include
`druid-bloom-filter` in the extensions load list. See [Loading
extensions](../../configuration/extensions.md#loading-extensions) for more
information.
-This extension adds the ability to both construct bloom filters from query
results, and filter query results by testing
-against a bloom filter. A Bloom filter is a probabilistic data structure for
performing a set membership check. A bloom
-filter is a good candidate to use with Druid for cases where an explicit
filter is impossible, e.g. filtering a query
+This extension adds the ability to both construct Bloom filters from query
results, and filter query results by testing
+against a Bloom filter. A Bloom filter is a probabilistic data structure for
performing a set membership check. A Bloom
+filter is a good candidate to use with Druid for cases where an explicit
filter is impossible, such as filtering a query
against a set of millions of values.
Following are some characteristics of Bloom filters:
-- Bloom filters are highly space efficient when compared to using a HashSet.
-- Because of the probabilistic nature of bloom filters, false positive results
are possible (element was not actually
-inserted into a bloom filter during construction, but `test()` says true)
-- False negatives are not possible (if element is present then `test()` will
never say false).
-- The false positive probability of this implementation is currently fixed at
5%, but increasing the number of entries
-that the filter can hold can decrease this false positive rate in exchange for
overall size.
-- Bloom filters are sensitive to number of elements that will be inserted in
the bloom filter. During the creation of bloom filter expected number of
entries must be specified. If the number of insertions exceed
- the specified initial number of entries then false positive probability will
increase accordingly.
+- Bloom filters are highly space efficient compared to using a HashSet.
+- Because of the probabilistic nature of Bloom filters, false positive results
are possible. For example, the `test()` function might return `true` for an
element that wasn't inserted into the filter.
Review Comment:
```suggestion
- Because they are probabilistic, false positive results are possible with
Bloom filters. For example, the `test()` function might return `true` for an
element that not within the filter.
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]