airborne12 opened a new pull request, #64229:
URL: https://github.com/apache/doris/pull/64229
### What problem does this PR solve?
Issue Number: N/A
Related PR: N/A
Problem Summary:
In storage-compute separation, querying an exact term that does **not**
exist in a
segment still pays the full searcher open -- open the compound reader,
materialize
`.tii` into memory, read `null_bitmap`, probe `.tis` -- only to discover the
term has
no postings. On S3 that is a wasted remote round-trip per segment.
This PR adds an optional, CLucene-compatible (no storage-format-version
bump),
**default-off** token-exists Bloom Filter:
- A self-describing `"tbf"` sub-file inside the compound `.idx` records
which analyzed
tokens exist in the segment's term dictionary, fed from the term
dictionary itself
(no re-tokenization, zero inconsistency). On query, an ABSENT verdict
short-circuits
to an empty bitmap before any searcher-open IO. The BF guarantees no false
negatives,
so absent -> empty is always correct; never-drop-results guardrails (A1
phrase
position grouping, A2 multi-term-slot OR, A3 analyzer-signature staleness,
A4 keyword
path, A5 empty keyword token) fall back to the normal lookup on any
uncertainty.
- An LRU cache of the parsed BF per (segment, index), so a warm absent query
does zero IO.
- Query-profile observability: headline `InvertedIndexTermBfSkippedLookups`
(lookups the
BF short-circuited) + `InvertedIndexTermBfProbe` (denominator for hit
rate) +
`InvertedIndexTermBfUnavailable` (no usable tbf), plus level-2 diagnostics
(cache
hit/miss, cold load IO, fall-throughs).
- An env-gated fpp sweep (analysis tool, never runs in CI) used to justify
keeping the
default `fpp = 0.01`.
Switches: index property `token_bloom_filter` (write) + BE config
`enable_inverted_index_term_bf` (read, default `false`); BF cache sized by
BE config
`inverted_index_term_bf_cache_limit` (default `1%`).
Measured (instrumented UT, 1M-row segment): absent
`MATCH_ALL`/`MATCH_PHRASE` read_at
8 -> 1; present queries unchanged (84 -> 84); warm absent query 0 sub-file
reads.
### Release note
Add an optional token-exists Bloom Filter for inverted indexes that
fast-paths absent
exact-term queries (skips the searcher open / index IO) under storage-compute
separation. Opt-in via index property `token_bloom_filter` and BE config
`enable_inverted_index_term_bf` (default off). No storage format change.
### Check List (For Author)
- Test
- [x] Unit Test
- Behavior changed:
- [x] No. <!-- opt-in, default-off; existing queries are unaffected when
the property/config are not set -->
- Does this need documentation?
- [x] Yes. <!-- a doc PR will follow once the feature graduates from
default-off; a design doc is kept internally for now -->
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]