Re: [PR] Enhance DictionaryFilter with map-based lookups for O(1) performance [skywalking-banyandb]

via GitHub Sat, 29 Nov 2025 05:42:51 -0800


Copilot commented on code in PR #872:
URL: 
https://github.com/apache/skywalking-banyandb/pull/872#discussion_r2573020240



##########
pkg/filter/dictionary_filter.go:
##########
@@ -18,51 +18,61 @@
 package filter
 
 import (
-       "bytes"
-
+       "github.com/apache/skywalking-banyandb/pkg/convert"
        pbv1 "github.com/apache/skywalking-banyandb/pkg/pb/v1"
 )
 
 // DictionaryFilter is a filter implementation backed by a dictionary.
+// It uses a map-based lookup for O(1) performance instead of O(n) linear 
search.
 type DictionaryFilter struct {
+       valueSet  map[string]struct{}
        values    [][]byte
        valueType pbv1.ValueType
 }
 
 // NewDictionaryFilter creates a new dictionary filter with the given values.
 func NewDictionaryFilter(values [][]byte) *DictionaryFilter {
-       return &DictionaryFilter{
+       df := &DictionaryFilter{
                values: values,
        }
+       df.buildValueSet()
+       return df
 }
 
 // MightContain checks if an item is in the dictionary.
+// For non-array types, it uses O(1) map lookup instead of O(n) linear search.
 func (df *DictionaryFilter) MightContain(item []byte) bool {
        if df.valueType == pbv1.ValueTypeStrArr || df.valueType == 
pbv1.ValueTypeInt64Arr {
-               for _, serializedArray := range df.values {
-                       if containElement(serializedArray, item, df.valueType) {
-                               return true
-                       }
+               // For array types, check if the item exists in the 
pre-computed element set
+               if df.valueSet != nil {
+                       _, exists := df.valueSet[convert.BytesToString(item)]
+                       return exists
                }
                return false
        }
 
-       for _, v := range df.values {
-               if bytes.Equal(v, item) {
-                       return true
-               }
+       // For non-array types, use O(1) map lookup
+       if df.valueSet != nil {
+               _, exists := df.valueSet[convert.BytesToString(item)]
+               return exists
        }
        return false
 }
 
-// SetValues sets the dictionary values.
+// SetValues sets the dictionary values and builds the lookup set.
 func (df *DictionaryFilter) SetValues(values [][]byte) {
        df.values = values
+       df.buildValueSet()
 }
 
 // SetValueType sets the value type for the dictionary filter.
+// For array types, it rebuilds the lookup set by extracting elements from 
serialized arrays.
 func (df *DictionaryFilter) SetValueType(valueType pbv1.ValueType) {
        df.valueType = valueType
+       // Rebuild the set for array types since elements need to be extracted
+       if valueType == pbv1.ValueTypeStrArr || valueType == 
pbv1.ValueTypeInt64Arr {
+               df.buildValueSet()
+       }

Review Comment:
   The `SetValueType` method rebuilds the value set only for array types, but 
when switching from an array type to a non-array type (or vice versa), the map 
will contain incorrect data. For example, if you first call 
`SetValueType(ValueTypeStrArr)` which extracts elements, then call 
`SetValueType(ValueTypeUnknown)`, the map still contains the extracted elements 
instead of the raw values. The method should rebuild the value set for all 
value type changes, not just for array types.
   ```suggestion
        df.buildValueSet()
   ```



##########
pkg/filter/dictionary_filter.go:
##########
@@ -72,33 +82,43 @@ func (df *DictionaryFilter) Reset() {
        }
        df.values = df.values[:0]
        df.valueType = pbv1.ValueTypeUnknown
+       clear(df.valueSet)
 }
 
-func containElement(serializedArray []byte, element []byte, valueType 
pbv1.ValueType) bool {
-       if len(serializedArray) == 0 {
-               return false
+// buildValueSet builds a map-based lookup set from the dictionary values.
+// For non-array types, values are added directly.
+// For array types, elements are extracted from serialized arrays.
+func (df *DictionaryFilter) buildValueSet() {
+       if df.valueSet == nil {
+               df.valueSet = make(map[string]struct{}, len(df.values))
        }

Review Comment:
   The `buildValueSet` method does not clear the existing map before rebuilding 
it. When called from `SetValues()` or `SetValueType()`, this will accumulate 
old values in the map, leading to incorrect `MightContain` results. Before 
adding new values, the map should be cleared with `clear(df.valueSet)` or a new 
map should be created.



##########
pkg/filter/dictionary_filter.go:
##########
@@ -72,33 +82,43 @@ func (df *DictionaryFilter) Reset() {
        }
        df.values = df.values[:0]
        df.valueType = pbv1.ValueTypeUnknown
+       clear(df.valueSet)
 }
 
-func containElement(serializedArray []byte, element []byte, valueType 
pbv1.ValueType) bool {
-       if len(serializedArray) == 0 {
-               return false
+// buildValueSet builds a map-based lookup set from the dictionary values.
+// For non-array types, values are added directly.
+// For array types, elements are extracted from serialized arrays.
+func (df *DictionaryFilter) buildValueSet() {
+       if df.valueSet == nil {
+               df.valueSet = make(map[string]struct{}, len(df.values))
        }
-       if valueType == pbv1.ValueTypeInt64Arr {
-               if len(element) != 8 {
-                       return false
-               }
-               for i := 0; i < len(serializedArray); i += 8 {
-                       if i+8 > len(serializedArray) {
-                               break
-                       }
-                       if bytes.Equal(serializedArray[i:i+8], element) {
-                               return true
+
+       if df.valueType == pbv1.ValueTypeInt64Arr {
+               // Extract int64 elements from serialized arrays
+               for _, serializedArray := range df.values {
+                       for i := 0; i+8 <= len(serializedArray); i += 8 {
+                               
df.valueSet[convert.BytesToString(serializedArray[i:i+8])] = struct{}{}
                        }
                }
-               return false
+               return
        }
-       if valueType == pbv1.ValueTypeStrArr {
-               return containString(serializedArray, element)
+
+       if df.valueType == pbv1.ValueTypeStrArr {
+               // Extract string elements from serialized arrays
+               for _, serializedArray := range df.values {
+                       extractStrings(serializedArray, df.valueSet)
+               }
+               return
+       }
+
+       // For non-array types, add values directly
+       for _, v := range df.values {
+               df.valueSet[convert.BytesToString(v)] = struct{}{}
        }

Review Comment:
   Using `convert.BytesToString` with `unsafe` for map keys is problematic when 
the source byte slices can be modified. The `BytesToString` function uses 
`unsafe.String` which doesn't copy data, so if the underlying byte slices in 
`df.values` are modified after being added to the filter, the map keys will 
reference corrupted memory. Since this is a public API and there's no 
documented contract preventing callers from modifying the byte slices, this 
could lead to incorrect lookups or data corruption. Consider using 
`string(...)` for safe string conversion when creating map keys, or document 
that callers must not modify byte slices after passing them to the filter.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Enhance DictionaryFilter with map-based lookups for O(1) performance [skywalking-banyandb]

Reply via email to