[PR] [globalindex] Support multi-column GlobalIndex framework [paimon]

via GitHub Thu, 21 May 2026 20:25:52 -0700


CrownChu opened a new pull request, #7933:
URL: https://github.com/apache/paimon/pull/7933


    Extend the GlobalIndex SPI, build path, and query path to support one index 
builder handling multiple columns (e.g. Lucene indexing title + content + tags 
together). Key changes:
     
     - GlobalIndexerFactory/GlobalIndexer: add List<DataField> create overloads
     - GlobalIndexMultiColumnWriter: new interface for multi-column writes
     - GlobalIndexBuilderUtils: toIndexFileMetas/createIndexWriter accept 
List<DataField>
     - GlobalIndexScanner: route extraFieldIds to same reader group
     - VectorScanImpl/FullTextScanImpl: match against extraFieldIds
     - GenericIndexTopoBuilder (Flink): multi-column projection and writer 
dispatch
     - DefaultGlobalIndexBuilder/TopoBuilder (Spark): multi-column support
     - All single-column APIs preserved for backward compatibility
   
     Purpose
   
     Some index engines (e.g. Lucene) can build a single index over multiple 
columns — full-text on title and vector on embedding in the same index file. 
Previously the GlobalIndex SPI only supported one column
     per indexer: GlobalIndexerFactory.create(DataField, Options) and 
GlobalIndexSingletonWriter.write(Object). This meant multi-column engines had 
to create separate index files per column, losing co-located
     search benefits and doubling I/O.
     
     This PR adds a multi-column path through the entire stack:
   
     1. SPI layer (paimon-common): GlobalIndexerFactory.create(List<DataField>, 
Options) default method (falls back to single-column for existing 
implementations). New GlobalIndexMultiColumnWriter interface
     accepts InternalRow with all indexed columns projected in field order.
     2. Core index metadata (paimon-core): 
GlobalIndexBuilderUtils.toIndexFileMetas populates 
GlobalIndexMeta.extraFieldIds from the field list beyond the first. 
GlobalIndexScanner.createReaders resolves the full
      field list from metadata and passes it to the factory. Extra field IDs 
are registered in the indexMetas map so queries against any column in the group 
find the same reader.
     3. Query path (paimon-core): VectorScanImpl and FullTextScanImpl check 
extraFieldIds when matching index files to query columns, so a vector query on 
embedding finds an index whose primary field is title but
      includes embedding as an extra field.
     4. Flink build (paimon-flink): GenericIndexTopoBuilder accepts 
List<String> indexColumns, projects all columns + _ROW_ID, and dispatches to 
GlobalIndexMultiColumnWriter.write(InternalRow) when multi-column.
     findMinNonIndexableRowId checks containsAll(indexColumns) for schema 
evolution safety.
     5. Spark build (paimon-spark): DefaultGlobalIndexBuilder and 
DefaultGlobalIndexTopoBuilder gain List<DataField> overloads, with 
single-column constructors delegating to the multi-column path.
   
     All existing single-column callers are unchanged — new APIs have default 
implementations that delegate to the original single-column methods.
   
     Tests
     
     - GenericIndexTopoBuilderTest: Updated findMinNonIndexableRowId call to 
use List<String> signature.
     - paimon-lucene module (on feature-globalindex-support-multi-test branch): 
A full Lucene 9.12.0 implementation exercising the multi-column framework 
end-to-end:
       - LuceneGlobalIndexTest — single-column vector write/search, top-K score 
ordering (2 tests)
       - LuceneFullTextIndexTest — single-column full-text write/search, score 
verification, no-results case (3 tests)
       - LuceneMultiColumnIndexTest — multi-column (text + vector) via 
GlobalIndexerFactory.create(List<DataField>, Options) → 
GlobalIndexMultiColumnWriter.write(InternalRow), verifies both full-text and 
vector
     queries on same index file; also tests SPI discovery path via 
GlobalIndexer.create("lucene", ...) (2 tests)
       - LuceneGlobalIndexScanTest — end-to-end Paimon table tests: creates 
FileStoreTable, writes data, builds Lucene index, commits via 
DataIncrement.indexIncrement, queries via
     VectorSearchBuilder/FullTextSearchBuilder → 
ReadBuilder.newScan().withGlobalIndexResult(), reads back rows and asserts 
correctness (3 tests)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] [globalindex] Support multi-column GlobalIndex framework [paimon]

Reply via email to