CrownChu opened a new pull request, #7933:
URL: https://github.com/apache/paimon/pull/7933
Extend the GlobalIndex SPI, build path, and query path to support one index
builder handling multiple columns (e.g. Lucene indexing title + content + tags
together). Key changes:
- GlobalIndexerFactory/GlobalIndexer: add List<DataField> create overloads
- GlobalIndexMultiColumnWriter: new interface for multi-column writes
- GlobalIndexBuilderUtils: toIndexFileMetas/createIndexWriter accept
List<DataField>
- GlobalIndexScanner: route extraFieldIds to same reader group
- VectorScanImpl/FullTextScanImpl: match against extraFieldIds
- GenericIndexTopoBuilder (Flink): multi-column projection and writer
dispatch
- DefaultGlobalIndexBuilder/TopoBuilder (Spark): multi-column support
- All single-column APIs preserved for backward compatibility
Purpose
Some index engines (e.g. Lucene) can build a single index over multiple
columns — full-text on title and vector on embedding in the same index file.
Previously the GlobalIndex SPI only supported one column
per indexer: GlobalIndexerFactory.create(DataField, Options) and
GlobalIndexSingletonWriter.write(Object). This meant multi-column engines had
to create separate index files per column, losing co-located
search benefits and doubling I/O.
This PR adds a multi-column path through the entire stack:
1. SPI layer (paimon-common): GlobalIndexerFactory.create(List<DataField>,
Options) default method (falls back to single-column for existing
implementations). New GlobalIndexMultiColumnWriter interface
accepts InternalRow with all indexed columns projected in field order.
2. Core index metadata (paimon-core):
GlobalIndexBuilderUtils.toIndexFileMetas populates
GlobalIndexMeta.extraFieldIds from the field list beyond the first.
GlobalIndexScanner.createReaders resolves the full
field list from metadata and passes it to the factory. Extra field IDs
are registered in the indexMetas map so queries against any column in the group
find the same reader.
3. Query path (paimon-core): VectorScanImpl and FullTextScanImpl check
extraFieldIds when matching index files to query columns, so a vector query on
embedding finds an index whose primary field is title but
includes embedding as an extra field.
4. Flink build (paimon-flink): GenericIndexTopoBuilder accepts
List<String> indexColumns, projects all columns + _ROW_ID, and dispatches to
GlobalIndexMultiColumnWriter.write(InternalRow) when multi-column.
findMinNonIndexableRowId checks containsAll(indexColumns) for schema
evolution safety.
5. Spark build (paimon-spark): DefaultGlobalIndexBuilder and
DefaultGlobalIndexTopoBuilder gain List<DataField> overloads, with
single-column constructors delegating to the multi-column path.
All existing single-column callers are unchanged — new APIs have default
implementations that delegate to the original single-column methods.
Tests
- GenericIndexTopoBuilderTest: Updated findMinNonIndexableRowId call to
use List<String> signature.
- paimon-lucene module (on feature-globalindex-support-multi-test branch):
A full Lucene 9.12.0 implementation exercising the multi-column framework
end-to-end:
- LuceneGlobalIndexTest — single-column vector write/search, top-K score
ordering (2 tests)
- LuceneFullTextIndexTest — single-column full-text write/search, score
verification, no-results case (3 tests)
- LuceneMultiColumnIndexTest — multi-column (text + vector) via
GlobalIndexerFactory.create(List<DataField>, Options) →
GlobalIndexMultiColumnWriter.write(InternalRow), verifies both full-text and
vector
queries on same index file; also tests SPI discovery path via
GlobalIndexer.create("lucene", ...) (2 tests)
- LuceneGlobalIndexScanTest — end-to-end Paimon table tests: creates
FileStoreTable, writes data, builds Lucene index, commits via
DataIncrement.indexIncrement, queries via
VectorSearchBuilder/FullTextSearchBuilder →
ReadBuilder.newScan().withGlobalIndexResult(), reads back rows and asserts
correctness (3 tests)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]