JingsongLi opened a new pull request, #8258:
URL: https://github.com/apache/paimon/pull/8258
## Summary
Add first-phase Spark SQL support for multi-vector search over multiple
vector columns. The implementation fans out to existing per-column global
vector indexes and fuses the scored row ids before the normal Paimon scan, so
it does not require any index format changes.
## Changes
- Add `multi_vector_search` table-valued function with query-map syntax,
final limit, and optional fusion options.
- Add multi-vector predicate objects and fusion utilities supporting `rrf`
and `weighted_score` with optional per-column weights.
- Extend Spark scan plumbing so a `VectorSearchTable` can carry either a
single vector search or a multi-vector search.
- Reuse existing vector index builders for each route, including
partition/data prefilters and query-time vector index options.
- Add parser/unit coverage and an end-to-end Spark SQL test that builds two
vector indexes and queries both columns.
Example:
```sql
SELECT id, __paimon_vector_search_score
FROM multi_vector_search(
'T',
map(
'title_vec', array(1.0f, 0.0f),
'body_vec', array(0.0f, 1.0f)),
2,
map('fusion', 'rrf', 'route_limit', '2'))
```
## Testing
- [x] `mvn -pl paimon-common -Pfast-build -Dtest=MultiVectorSearchFusionTest
test`
- [x] `mvn -pl paimon-spark/paimon-spark-common -am -Pfast-build
-DfailIfNoTests=false
-DwildcardSuites=org.apache.paimon.spark.catalyst.plans.logical.VectorSearchQueryTest
-Dtest=none test`
- [x] `mvn -Pspark3 -pl
:paimon-spark-common_2.12,:paimon-spark3-common_2.12,:paimon-spark-ut_2.12 -am
-Pfast-build -DfailIfNoTests=false
-DwildcardSuites=org.apache.paimon.spark.sql.MultiVectorSearchTest -Dtest=none
test`
- [x] `mvn -Pspark3 -pl :paimon-spark-3.2_2.12,:paimon-spark-3.3_2.12 -am
-Pfast-build -DskipTests compile`
- [x] `git diff --check`
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]