[PR] [common] Introduce N-gram file index for query [paimon]

via GitHub Thu, 21 May 2026 07:22:43 -0700


xuzifu666 opened a new pull request, #7927:
URL: https://github.com/apache/paimon/pull/7927


   ### Purpose
   Currently Paimon not support N-gram file index, so there is room for 
improvement in scenarios involving prefix and suffix queries.
   Let me briefly explain the principles and workflow of the n-gram file index 
within this PR：
   
   ```
   
┌─────────────────────────────────────────────────────────────────────────────────┐
      │  1. Overall Architecture (Integration with Paimon FileIndexer 
Framework)        │
      
└─────────────────────────────────────────────────────────────────────────────────┘
   
                                  FileIndexer Interface
                                          │
                          ┌───────────────┼───────────────┐
                          │               │               │
                  BloomFilter         Bitmap          N-gram ⭐
                     Index             Index           Index
                   (equality)        (equality)    (prefix/suffix)
   
                          N-gram File Index
                                 │
              ┌──────────────────┼──────────────────┐
              │                  │                  │
          Writer           Factory            Reader
         (Build)        (SPI Creation)       (Query Filter)
              │                  │                  │
              ▼                  ▼                  ▼
   
         Write Data  →   NgramFileIndex    →   Query Filter
         Generate N-gram  (Core Impl)       Apply Predicates
         Store HashSet    gram_size param    REMAIN/SKIP
   
   
      
┌─────────────────────────────────────────────────────────────────────────────────┐
      │  2. Index Build Process (Writing Phase)                                 
        │
      
└─────────────────────────────────────────────────────────────────────────────────┘
   
                 Input Rows
                      │
                      ▼
          ┌──────────────────────┐
          │ write("hello")       │
          │ write("world")       │
          └────────────┬─────────┘
                       │
                       ▼
          ┌──────────────────────────────────────┐
          │ 1. BinaryString → String conversion  │
          │ 2. Extract N-grams                   │
          │    "hello" → {he, el, ll, lo}        │
          │    "world" → {wo, or, rl, ld}        │
          │ 3. Add to HashSet                    │
          └────────────┬─────────────────────────┘
                       │
                       ▼
          ┌──────────────────────────────────────┐
          │ Final N-gram Set:                    │
          │ {he, el, ll, lo, wo, or, rl, ld}     │
          │                                      │
          │ Size: 680 bytes (100K records)       │
          │ Compression ratio: 0.03%             │
          └────────────┬─────────────────────────┘
                       │
                       ▼
          ┌──────────────────────────────────────┐
          │ Serialization Format:                │
          │ [4B gramSize][4B setSize]            │
          │ [2B len1][N bytes token1]            │
          │ [2B len2][N bytes token2]            │
          │ ...                                  │
          └────────────┬─────────────────────────┘
                       │
                       ▼
                 Index Bytes
              (Written to file)
   
   
      
┌─────────────────────────────────────────────────────────────────────────────────┐
      │  3. Query Execution Flow (Reading & Filtering Phase)                    
        │
      
└─────────────────────────────────────────────────────────────────────────────────┘
   
          SQL Query
            │
            ├─ LIKE 'he%'
            ├─ LIKE '%lo'
            ├─ LIKE '%ll%'
            └─ = 'hello'
   
               ▼
          ┌────────────────────────────────┐
          │ Predicate Optimization         │
          │ LIKE 'prefix%'                 │
          │   → StartsWith("prefix")       │
          │ LIKE '%suffix'                 │
          │   → EndsWith("suffix")         │
          │ LIKE '%middle%'                │
          │   → Contains("middle")         │
          └────────────┬───────────────────┘
                       │
                       ▼
          ┌──────────────────────────────────────┐
          │ FileIndexPredicate.evaluate()        │
          │ Iterate over each data file          │
          └────────────┬─────────────────────--──┘
                       │
                       ▼
          ┌──────────────────────────────────────┐
          │ visitStartsWith(fieldRef, "he")      │
          │                                      │
          │ 1. Get query pattern: "he"           │
          │ 2. Generate N-grams: {he}            │
          │ 3. Check each against index set      │
          │                                      │
          │ Check "he" ∈ {he,el,ll,lo,...}?      │
          └────────────┬─────────────────────--──┘
                       │
              ┌────────┴────────┐
              │                 │
              ▼                 ▼
             YES               NO
              │                 │
              ▼                 ▼
          ┌──────────────┐  ┌──────────────┐
          │ REMAIN       │  │ SKIP         │
          │ File might   │  │ File cannot  │
          │ contain data │  │ contain data │
          │ (scan rows)  │  │ (skip file)  │
          └──────────────┘  └──────────────┘
              │                 │
              └────────┬────────┘
                       │
                       ▼
              Merge results & row-level scan
   
   
      
┌─────────────────────────────────────────────────────────────────────────────────┐
      │  4. Filter Decision Logic (Decision Tree)                               
        │
      
└─────────────────────────────────────────────────────────────────────────────────┘
   
          Query pattern: pattern
                  │
                  ▼
          ┌──────────────────────────────┐
          │ pattern == null?             │ YES ──► REMAIN (conservative)
          │ pattern.isEmpty()?           │ YES ──► REMAIN (conservative)
          │ pattern.length < gramSize?   │ YES ──► REMAIN (cannot judge)
          └──────────────┬───────────────┘
                         │ NO
                         ▼
          ┌──────────────────────────────┐
          │ FOR i = 0 TO pattern.length  │
          │     ngram = pattern[i:i+g]   │
          │     IF ngram ∉ ngramSet:     │
          │         RETURN SKIP          │ Early exit (99% case)
          │ RETURN REMAIN               │
          └──────────────────────────────┘
   
   
      
┌─────────────────────────────────────────────────────────────────────────────────┐
      │  5. Data Flow Diagram (From Data to Filter Result)                      
        │
      
└─────────────────────────────────────────────────────────────────────────────────┘
   
          Input Data (100K rows)
                  │
                  ▼
          ┌──────────────────────────┐
          │ NgramFileIndex.Writer    │ (38 ms)
          │ Build index              │ 2,631 rows/ms
          └──────────────┬───────────┘
                         │
                         ▼
          ┌──────────────────────────┐
          │ Index Bytes (680 bytes)  │
          │ {N-gram set serialized}  │
          └──────────────┬───────────┘
                         │
              ┌──────────┴──────────┐
              │                     │
              ▼                     ▼
          File 1              File 1000
          Index segment       Index segment
              │                     │
              ▼ (25 µs)            ▼ (25 µs)
          ┌─────────────┐    ┌─────────────┐
          │ visitXxx()  │    │ visitXxx()  │
          │ REMAIN/SKIP │    │ REMAIN/SKIP │
          └─────────────┘    └─────────────┘
              │                     │
              └──────────┬──────────┘
                         │
                         ▼
          ┌──────────────────────────────┐
          │ File-level filter result     │
          │ - REMAIN: 100 files          │
          │ - SKIP: 900 files            │
          │ Skipped 900/1000 files       │
          └──────────────┬───────────────┘
                         │
                         ▼
          ┌──────────────────────────────┐
          │ Row-level scan (REMAIN only) │
          │ 100 files × 100K rows        │
          │ = 10M rows (vs 100M without) │
          │ Reduced 90%                  │
          └──────────────────────────────┘
   
   ```
   
   benchmark test result：
   ```
      
┌────────────────────────────────────────────────────────────────────────────────┐
      │ REAL-WORLD PERFORMANCE GAINS (Scenario: Query 1,000 files, 100K rows 
each)   │
      
├────────────────────────────────────────────────────────────────────────────────┤
      │                                                                         
       │
      │  No Index                                 With N-gram Index             
      │
      │  ─────────────────────────────────────    
─────────────────────────────────   │
      │  • Scan: 1,000 files × 100K rows         • Index: 1,000 × 25µs = 25ms   
    │
      │  • Total: 100 million rows               • Scan: 100 files × 100K rows  
    │
      │  • Latency: ~100 ms                      • Total: 10 million rows       
    │
      │  • Files Scanned: 100%                   • Latency: ~26 ms              
   │
      │                                          • Files Scanned: 10%           
    │
      │                                                                         
       │
      │  IMPROVEMENT: 74% faster | 90% fewer rows scanned | 99% I/O reduction   
    │
      │                                                                         
       │
      
└────────────────────────────────────────────────────────────────────────────────┘
   ```
   
   **The current solution does not employ a Bloom filter, primarily to avoid 
the issue of false positives.**
   
   ### Tests
   NgramFileIndexSimpleTest.java
   NgramFileIndexTest.java


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] [common] Introduce N-gram file index for query [paimon]

Reply via email to