xuzifu666 opened a new pull request, #7927:
URL: https://github.com/apache/paimon/pull/7927
### Purpose
Currently Paimon not support N-gram file index, so there is room for
improvement in scenarios involving prefix and suffix queries.
Let me briefly explain the principles and workflow of the n-gram file index
within this PR:
```
┌─────────────────────────────────────────────────────────────────────────────────┐
│ 1. Overall Architecture (Integration with Paimon FileIndexer
Framework) │
└─────────────────────────────────────────────────────────────────────────────────┘
FileIndexer Interface
│
┌───────────────┼───────────────┐
│ │ │
BloomFilter Bitmap N-gram ⭐
Index Index Index
(equality) (equality) (prefix/suffix)
N-gram File Index
│
┌──────────────────┼──────────────────┐
│ │ │
Writer Factory Reader
(Build) (SPI Creation) (Query Filter)
│ │ │
▼ ▼ ▼
Write Data → NgramFileIndex → Query Filter
Generate N-gram (Core Impl) Apply Predicates
Store HashSet gram_size param REMAIN/SKIP
┌─────────────────────────────────────────────────────────────────────────────────┐
│ 2. Index Build Process (Writing Phase)
│
└─────────────────────────────────────────────────────────────────────────────────┘
Input Rows
│
▼
┌──────────────────────┐
│ write("hello") │
│ write("world") │
└────────────┬─────────┘
│
▼
┌──────────────────────────────────────┐
│ 1. BinaryString → String conversion │
│ 2. Extract N-grams │
│ "hello" → {he, el, ll, lo} │
│ "world" → {wo, or, rl, ld} │
│ 3. Add to HashSet │
└────────────┬─────────────────────────┘
│
▼
┌──────────────────────────────────────┐
│ Final N-gram Set: │
│ {he, el, ll, lo, wo, or, rl, ld} │
│ │
│ Size: 680 bytes (100K records) │
│ Compression ratio: 0.03% │
└────────────┬─────────────────────────┘
│
▼
┌──────────────────────────────────────┐
│ Serialization Format: │
│ [4B gramSize][4B setSize] │
│ [2B len1][N bytes token1] │
│ [2B len2][N bytes token2] │
│ ... │
└────────────┬─────────────────────────┘
│
▼
Index Bytes
(Written to file)
┌─────────────────────────────────────────────────────────────────────────────────┐
│ 3. Query Execution Flow (Reading & Filtering Phase)
│
└─────────────────────────────────────────────────────────────────────────────────┘
SQL Query
│
├─ LIKE 'he%'
├─ LIKE '%lo'
├─ LIKE '%ll%'
└─ = 'hello'
▼
┌────────────────────────────────┐
│ Predicate Optimization │
│ LIKE 'prefix%' │
│ → StartsWith("prefix") │
│ LIKE '%suffix' │
│ → EndsWith("suffix") │
│ LIKE '%middle%' │
│ → Contains("middle") │
└────────────┬───────────────────┘
│
▼
┌──────────────────────────────────────┐
│ FileIndexPredicate.evaluate() │
│ Iterate over each data file │
└────────────┬─────────────────────--──┘
│
▼
┌──────────────────────────────────────┐
│ visitStartsWith(fieldRef, "he") │
│ │
│ 1. Get query pattern: "he" │
│ 2. Generate N-grams: {he} │
│ 3. Check each against index set │
│ │
│ Check "he" ∈ {he,el,ll,lo,...}? │
└────────────┬─────────────────────--──┘
│
┌────────┴────────┐
│ │
▼ ▼
YES NO
│ │
▼ ▼
┌──────────────┐ ┌──────────────┐
│ REMAIN │ │ SKIP │
│ File might │ │ File cannot │
│ contain data │ │ contain data │
│ (scan rows) │ │ (skip file) │
└──────────────┘ └──────────────┘
│ │
└────────┬────────┘
│
▼
Merge results & row-level scan
┌─────────────────────────────────────────────────────────────────────────────────┐
│ 4. Filter Decision Logic (Decision Tree)
│
└─────────────────────────────────────────────────────────────────────────────────┘
Query pattern: pattern
│
▼
┌──────────────────────────────┐
│ pattern == null? │ YES ──► REMAIN (conservative)
│ pattern.isEmpty()? │ YES ──► REMAIN (conservative)
│ pattern.length < gramSize? │ YES ──► REMAIN (cannot judge)
└──────────────┬───────────────┘
│ NO
▼
┌──────────────────────────────┐
│ FOR i = 0 TO pattern.length │
│ ngram = pattern[i:i+g] │
│ IF ngram ∉ ngramSet: │
│ RETURN SKIP │ Early exit (99% case)
│ RETURN REMAIN │
└──────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────────────┐
│ 5. Data Flow Diagram (From Data to Filter Result)
│
└─────────────────────────────────────────────────────────────────────────────────┘
Input Data (100K rows)
│
▼
┌──────────────────────────┐
│ NgramFileIndex.Writer │ (38 ms)
│ Build index │ 2,631 rows/ms
└──────────────┬───────────┘
│
▼
┌──────────────────────────┐
│ Index Bytes (680 bytes) │
│ {N-gram set serialized} │
└──────────────┬───────────┘
│
┌──────────┴──────────┐
│ │
▼ ▼
File 1 File 1000
Index segment Index segment
│ │
▼ (25 µs) ▼ (25 µs)
┌─────────────┐ ┌─────────────┐
│ visitXxx() │ │ visitXxx() │
│ REMAIN/SKIP │ │ REMAIN/SKIP │
└─────────────┘ └─────────────┘
│ │
└──────────┬──────────┘
│
▼
┌──────────────────────────────┐
│ File-level filter result │
│ - REMAIN: 100 files │
│ - SKIP: 900 files │
│ Skipped 900/1000 files │
└──────────────┬───────────────┘
│
▼
┌──────────────────────────────┐
│ Row-level scan (REMAIN only) │
│ 100 files × 100K rows │
│ = 10M rows (vs 100M without) │
│ Reduced 90% │
└──────────────────────────────┘
```
benchmark test result:
```
┌────────────────────────────────────────────────────────────────────────────────┐
│ REAL-WORLD PERFORMANCE GAINS (Scenario: Query 1,000 files, 100K rows
each) │
├────────────────────────────────────────────────────────────────────────────────┤
│
│
│ No Index With N-gram Index
│
│ ─────────────────────────────────────
───────────────────────────────── │
│ • Scan: 1,000 files × 100K rows • Index: 1,000 × 25µs = 25ms
│
│ • Total: 100 million rows • Scan: 100 files × 100K rows
│
│ • Latency: ~100 ms • Total: 10 million rows
│
│ • Files Scanned: 100% • Latency: ~26 ms
│
│ • Files Scanned: 10%
│
│
│
│ IMPROVEMENT: 74% faster | 90% fewer rows scanned | 99% I/O reduction
│
│
│
└────────────────────────────────────────────────────────────────────────────────┘
```
**The current solution does not employ a Bloom filter, primarily to avoid
the issue of false positives.**
### Tests
NgramFileIndexSimpleTest.java
NgramFileIndexTest.java
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]