(druid) branch master updated: [Docs] Improve Bloom filter topic (#17547)

techdocsmith Tue, 10 Dec 2024 11:46:00 -0800

This is an automated email from the ASF dual-hosted git repository.

techdocsmith pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/druid.git



The following commit(s) were added to refs/heads/master by this push:
     new a51061fa434 [Docs] Improve Bloom filter topic (#17547)
a51061fa434 is described below

commit a51061fa4347ef7dc0eb58f05d4bc308e41b1b50
Author: Katya Macedo <38017980+ektra...@users.noreply.github.com>
AuthorDate: Tue Dec 10 13:43:56 2024 -0600

    [Docs] Improve Bloom filter topic (#17547)
    
    * [Docs] Improve Bloom filter topic
    
    * Apply suggestions from code review
    
    Co-authored-by: Charles Smith <techdocsm...@gmail.com>
    
    * Update spelling file
    
    ---------
    
    Co-authored-by: Charles Smith <techdocsm...@gmail.com>
---
 docs/development/extensions-core/bloom-filter.md | 113 +++++++++++------------
 website/.spelling                                |   1 +
 2 files changed, 56 insertions(+), 58 deletions(-)

diff --git a/docs/development/extensions-core/bloom-filter.md 
b/docs/development/extensions-core/bloom-filter.md
index 30cebeef6c8..c0167e446da 100644
--- a/docs/development/extensions-core/bloom-filter.md
+++ b/docs/development/extensions-core/bloom-filter.md
@@ -23,28 +23,25 @@ title: "Bloom Filter"
   -->
 
 
-To use this Apache Druid extension, 
[include](../../configuration/extensions.md#loading-extensions) 
`druid-bloom-filter` in the extensions load list.
+To use the Apache Druid&circledR; Bloom filter extension, include 
`druid-bloom-filter` in the extensions load list. See [Loading 
extensions](../../configuration/extensions.md#loading-extensions) for more 
information.
 
-This extension adds the ability to both construct bloom filters from query 
results, and filter query results by testing
-against a bloom filter. A Bloom filter is a probabilistic data structure for 
performing a set membership check. A bloom
-filter is a good candidate to use with Druid for cases where an explicit 
filter is impossible, e.g. filtering a query
+This extension adds the abilities to construct Bloom filters from query 
results and to filter query results by testing
+against a Bloom filter. A Bloom filter is a probabilistic data structure to 
check for set membership. A Bloom
+filter is a good candidate to use when an explicit filter is impossible, such 
as filtering a query
 against a set of millions of values.
 
 Following are some characteristics of Bloom filters:
 
-- Bloom filters are highly space efficient when compared to using a HashSet.
-- Because of the probabilistic nature of bloom filters, false positive results 
are possible (element was not actually
-inserted into a bloom filter during construction, but `test()` says true)
-- False negatives are not possible (if element is present then `test()` will 
never say false).
-- The false positive probability of this implementation is currently fixed at 
5%, but increasing the number of entries
-that the filter can hold can decrease this false positive rate in exchange for 
overall size.
-- Bloom filters are sensitive to number of elements that will be inserted in 
the bloom filter. During the creation of bloom filter expected number of 
entries must be specified. If the number of insertions exceed
- the specified initial number of entries then false positive probability will 
increase accordingly.
+- Bloom filters are significantly more space efficient than HashSets.
+- Because they are probabilistic, false positive results are possible with 
Bloom filters. For example, the `test()` function might return `true` for an 
element that is not within the filter.
+- False negatives are not possible. If an element is present, `test()` always 
returns `true`.
+- The false positive probability of this implementation is fixed at 5%. 
Increasing the number of entries that the filter can hold can decrease this 
false positive rate in exchange for overall size.
+- Bloom filters are sensitive to the number of inserted elements. You must 
specify the expected number of entries at creation time. If the number of 
insertions exceeds the specified number of entries, the false positive 
probability increases accordingly.
 
-This extension is currently based on 
`org.apache.hive.common.util.BloomKFilter` from `hive-storage-api`. Internally,
+This extension is based on `org.apache.hive.common.util.BloomKFilter` from 
`hive-storage-api`. Internally,
 this implementation uses Murmur3 as the hash algorithm.
 
-To construct a BloomKFilter externally with Java to use as a filter in a Druid 
query:
+The following Java example shows how to construct a BloomKFilter externally:
 
 ```java
 BloomKFilter bloomFilter = new BloomKFilter(1500);
@@ -56,11 +53,12 @@ BloomKFilter.serialize(byteArrayOutputStream, bloomFilter);
 String base64Serialized = 
Base64.encodeBase64String(byteArrayOutputStream.toByteArray());
 ```
 
-This string can then be used in the native or SQL Druid query.
+You can then use the Base64 encoded string in JSON-based or SQL-based queries 
in Druid.
 
-## Filtering queries with a Bloom Filter
+## Filter queries with a Bloom filter
+
+### JSON specification
 
-### JSON Specification of Bloom Filter
 ```json
 {
   "type" : "bloom",
@@ -70,50 +68,46 @@ This string can then be used in the native or SQL Druid 
query.
 }
 ```
 
-|Property                 |Description                   |required?            
               |
-|-------------------------|------------------------------|----------------------------------|
-|`type`                   |Filter Type. Should always be `bloom`|yes|
-|`dimension`              |The dimension to filter over. | yes |
-|`bloomKFilter`           |Base64 encoded Binary representation of 
`org.apache.hive.common.util.BloomKFilter`| yes |
-|`extractionFn`|[Extraction 
function](../../querying/dimensionspecs.md#extraction-functions) to apply to 
the dimension values |no|
-
+|Property|Description|Required|
+|--------|-----------|--------|
+|`type`|Filter type. Set to `bloom`.|Yes|
+|`dimension`|Dimension to filter over.|Yes|
+|`bloomKFilter`|Base64 encoded binary representation of 
`org.apache.hive.common.util.BloomKFilter`.|Yes|
+|`extractionFn`|[Extraction 
function](../../querying/dimensionspecs.md#extraction-functions) to apply to 
the dimension values.|No|
 
-### Serialized Format for BloomKFilter
+### Serialized format for BloomKFilter
 
- Serialized BloomKFilter format:
+Serialized BloomKFilter format:
 
- - 1 byte for the number of hash functions.
- - 1 big endian int(That is how OutputStream works) for the number of longs in 
the bitset
- - big endian longs in the BloomKFilter bitset
+- 1 byte for the number of hash functions.
+- 1 big-endian integer for the number of longs in the bitset.
+- Big-endian longs in the BloomKFilter bitset.
 
-Note: `org.apache.hive.common.util.BloomKFilter` provides a serialize method 
which can be used to serialize bloom filters to outputStream.
+`org.apache.hive.common.util.BloomKFilter` provides a method to serialize 
Bloom filters to `outputStream`.
 
-### Filtering SQL Queries
+### Filter SQL queries
 
-Bloom filters can be used in SQL `WHERE` clauses via the `bloom_filter_test` 
operator:
+You can use Bloom filters in SQL `WHERE` clauses with the `bloom_filter_test` 
operator:
 
 ```sql
 SELECT COUNT(*) FROM druid.foo WHERE bloom_filter_test(<expr>, 
'<serialized_bytes_for_BloomKFilter>')
 ```
 
-### Expression and Virtual Column Support
+### Expression and virtual column support
 
-The bloom filter extension also adds a bloom filter [Druid 
expression](../../querying/math-expr.md) which shares syntax
+The Bloom filter extension also adds a Bloom filter [Druid 
expression](../../querying/math-expr.md) which shares syntax
 with the SQL operator.
 
 ```sql
 bloom_filter_test(<expr>, '<serialized_bytes_for_BloomKFilter>')
 ```
 
-## Bloom Filter Query Aggregator
+## Bloom filter query aggregator
 
-Input for a `bloomKFilter` can also be created from a druid query with the 
`bloom` aggregator. Note that it is very
-important to set a reasonable value for the `maxNumEntries` parameter, which 
is the maximum number of distinct entries
-that the bloom filter can represent without increasing the false positive 
rate. It may be worth performing a query using
-one of the unique count sketches to calculate the value for this parameter in 
order to build a bloom filter appropriate
-for the query.
+You can create an input for a `BloomKFilter` from a Druid query with the 
`bloom` aggregator. Make sure to set a reasonable value for the `maxNumEntries` 
parameter to specify the maximum number of distinct entries that the Bloom 
filter can represent without increasing the false positive rate. Try performing 
a query using
+one of the unique count sketches to calculate the value for this parameter to 
build a Bloom filter appropriate for the query.
 
-### JSON Specification of Bloom Filter Aggregator
+### JSON specification
 
 ```json
 {
@@ -124,15 +118,17 @@ for the query.
     }
 ```
 
-|Property                 |Description                   |required?            
               |
-|-------------------------|------------------------------|----------------------------------|
-|`type`                   |Aggregator Type. Should always be `bloom`|yes|
-|`name`                   |Output field name |yes|
-|`field`                  |[DimensionSpec](../../querying/dimensionspecs.md) 
to add to `org.apache.hive.common.util.BloomKFilter` | yes |
-|`maxNumEntries`          |Maximum number of distinct values supported by 
`org.apache.hive.common.util.BloomKFilter`, default `1500`| no |
+|Property|Description|Required|
+|--------|-----------|--------|
+|`type`|Aggregator type. Set to `bloom`.|Yes|
+|`name`|Output field name.|Yes|
+|`field`|[DimensionSpec](../../querying/dimensionspecs.md) to add to 
`org.apache.hive.common.util.BloomKFilter`.|Yes|
+|`maxNumEntries`|Maximum number of distinct values supported by 
`org.apache.hive.common.util.BloomKFilter`. Defaults to `1500`.|No|
 
 ### Example
 
+The following example shows a timeseries query object with a `bloom` 
aggregator:
+
 ```json
 {
   "queryType": "timeseries",
@@ -154,25 +150,26 @@ for the query.
 }
 ```
 
-response
+Example response:
 
 ```json
-[{"timestamp":"2015-09-12T00:00:00.000Z","result":{"userBloom":"BAAAJhAAAA..."}}]
+[
+  {
+    "timestamp":"2015-09-12T00:00:00.000Z",
+    "result":{"userBloom":"BAAAJhAAAA..."}
+  }
+]
 ```
 
-These values can then be set in the filter specification described above.
-
-Ordering results by a bloom filter aggregator, for example in a TopN query, 
will perform a comparatively expensive
-linear scan _of the filter itself_ to count the number of set bits as a means 
of approximating how many items have been
-added to the set. As such, ordering by an alternate aggregation is recommended 
if possible.
+We recommend ordering by an alternative aggregation method instead of ordering 
results by a Bloom filter aggregator.
+Ordering results by a Bloom filter aggregator can be resource-intensive 
because Druid performs an expensive linear scan of the filter to approximate 
the count of items added to the set by counting the number of set bits. 
 
+### SQL Bloom filter aggregator
 
-### SQL Bloom Filter Aggregator
-Bloom filters can be computed in SQL expressions with the `bloom_filter` 
aggregator:
+You can compute Bloom filters in SQL expressions with the BLOOM_FILTER 
aggregator. For example:
 
 ```sql
 SELECT BLOOM_FILTER(<expression>, <max number of entries>) FROM druid.foo 
WHERE dim2 = 'abc'
 ```
 
-but requires the setting `druid.sql.planner.serializeComplexValues` to be set 
to `true`. Bloom filter results in a SQL
- response are serialized into a base64 string, which can then be used in 
subsequent queries as a filter.
+Druid serializes Bloom filter results in a SQL response into a Base64 string. 
You can use the resulting string in subsequent queries as a filter.
diff --git a/website/.spelling b/website/.spelling
index 9a9b2eb712d..1711a76fea2 100644
--- a/website/.spelling
+++ b/website/.spelling
@@ -117,6 +117,7 @@ Guice
 HDFS
 HLL
 HashSet
+HashSets
 Homebrew
 html
 HyperLogLog


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org

(druid) branch master updated: [Docs] Improve Bloom filter topic (#17547)

Reply via email to