Re: [PR] Remove Bloom filter on dictionary-encoded tags for stream [skywalking-banyandb]

via GitHub Tue, 18 Nov 2025 07:42:23 -0800


Copilot commented on code in PR #849:
URL: 
https://github.com/apache/skywalking-banyandb/pull/849#discussion_r2538664882



##########
pkg/filter/dictionary_filter.go:
##########
@@ -0,0 +1,57 @@
+// Licensed to Apache Software Foundation (ASF) under one or more contributor
+// license agreements. See the NOTICE file distributed with
+// this work for additional information regarding copyright
+// ownership. Apache Software Foundation (ASF) licenses this file to you under
+// the Apache License, Version 2.0 (the "License"); you may
+// not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+package filter
+
+import (
+       "bytes"
+)
+
+// DictionaryFilter is a filter implementation backed by a dictionary.
+type DictionaryFilter struct {
+       values [][]byte
+}
+
+// NewDictionaryFilter creates a new dictionary filter with the given values.
+func NewDictionaryFilter(values [][]byte) *DictionaryFilter {
+       return &DictionaryFilter{
+               values: values,
+       }
+}
+
+// MightContain checks if an item is in the dictionary.
+func (df *DictionaryFilter) MightContain(item []byte) bool {
+       for _, v := range df.values {
+               if bytes.Equal(v, item) {
+                       return true
+               }
+       }
+       return false
+}

Review Comment:
   The MightContain method uses linear search O(n) through all dictionary 
values. Since dictionary encoding is limited to 256 unique values maximum, this 
could perform up to 256 byte slice comparisons per lookup. Consider using a 
map[string]struct{} for O(1) average-case lookups, especially since this filter 
is used during query execution where performance is critical.



##########
banyand/stream/tag_filter.go:
##########
@@ -113,19 +140,33 @@ func (tff tagFamilyFilter) 
unmarshal(tagFamilyMetadataBlock *dataBlock, metaRead
                logger.Panicf("%s: cannot unmarshal tagFamilyMetadata: %v", 
metaReader.Path(), err)
        }
        for _, tm := range tfm.tagMetadata {
-               if tm.filterBlock.size == 0 {
-                       continue
-               }
-               bb.Buf = pkgbytes.ResizeExact(bb.Buf[:0], 
int(tm.filterBlock.size))
-               fs.MustReadData(filterReader, int64(tm.filterBlock.offset), 
bb.Buf)
-               bf := generateBloomFilter()
-               bf = decodeBloomFilter(bb.Buf, bf)
                tf := generateTagFilter()
-               tf.filter = bf
                if tm.valueType == pbv1.ValueTypeInt64 {
                        tf.min = tm.min
                        tf.max = tm.max
                }
+               if tm.filterBlock.size > 0 {
+                       bb.Buf = pkgbytes.ResizeExact(bb.Buf[:0], 
int(tm.filterBlock.size))
+                       fs.MustReadData(filterReader, 
int64(tm.filterBlock.offset), bb.Buf)
+                       bf := generateBloomFilter()
+                       bf = decodeBloomFilter(bb.Buf, bf)
+                       tf.filter = bf
+               } else {
+                       encodingTypeBuf := make([]byte, 1)
+                       fs.MustReadData(tagValueReader, int64(tm.offset), 
encodingTypeBuf)
+                       encodingType := encoding.EncodeType(encodingTypeBuf[0])
+                       if encodingType == encoding.EncodeTypeDictionary {
+                               bb.Buf = pkgbytes.ResizeExact(bb.Buf[:0], 
int(tm.size))
+                               fs.MustReadData(tagValueReader, 
int64(tm.offset), bb.Buf)

Review Comment:
   The code reads the encoding type byte separately (line 156), then reads the 
full tag value block including the encoding type byte again (line 160). This 
results in reading the first byte twice. Consider reading the full block once 
and checking the first byte, which would eliminate one file read operation per 
tag.
   ```suggestion
                        bb.Buf = pkgbytes.ResizeExact(bb.Buf[:0], int(tm.size))
                        fs.MustReadData(tagValueReader, int64(tm.offset), 
bb.Buf)
                        encodingType := encoding.EncodeType(bb.Buf[0])
                        if encodingType == encoding.EncodeTypeDictionary {
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Remove Bloom filter on dictionary-encoded tags for stream [skywalking-banyandb]

Reply via email to