jerry-024 commented on PR #6807:
URL: https://github.com/apache/paimon/pull/6807#issuecomment-3673396350
> @jerry-024 Vector search and future Full-text search should inherit from
TopN, furthermore, I think the results of the predicate index should be
integrated into the TopN like this:
>
> RangeBitmapFileIndex.java
>
> ```java
> //...
> public FileIndexResult visitTopN(TopN topN) {
> FileIndexResult result = topN.getPredicateIndexResult();
> RoaringBitmap32 foundSet =
> result instanceof BitmapIndexResult ? ((BitmapIndexResult)
result).get() : null;
>
> int limit = topN.limit();
> List<SortValue> orders = topN.orders();
> SortValue sort = orders.get(0);
> SortValue.NullOrdering nullOrdering = sort.nullOrdering();
> boolean strict = orders.size() == 1;
> if (ASCENDING.equals(sort.direction())) {
> return new BitmapIndexResult(
> () -> bitmap.bottomK(limit, nullOrdering, foundSet,
strict));
> } else {
> return new BitmapIndexResult(
> () -> bitmap.topK(limit, nullOrdering, foundSet, strict));
> }
> }
> ```
>
> FileIndexReader.java
>
> ```java
> //...
> public FileIndexResult visitTopN(TopN topN) {
> return REMAIN;
> }
> public FileIndexResult visitVectorSearch(VectorSearch search){
> return visitTopN(search);
> }
> public FileIndexResult visitFullTextSearch(FullTextSearch search){
> return visitTopN(search);
> }
> ```
>
> cc @Tan-JiaLiang @lxy-9602
@hang8929201 Thanks for the proposal. However, regarding the class
hierarchy, I wonder if it might be better to keep VectorSearch (and future
Full-text search) distinct from TopN, rather than inheriting directly. Here are
a few thoughts for your consideration:
1. Semantics of orders vs. similarity TopN is typically defined by explicit
orders (SortValues). Vector Search, however, operates a bit differently—it
focuses on finding the "nearest" matches based on distance calculations rather
than sorting by scalar columns. If we inherit from TopN, we might end up
carrying the orders field, which isn't strictly necessary for vector queries
and might obscure the intent of the search.
2. Architecture in other systems (Lance & PG) Looking at how other systems
handle this might offer some perspective:
LanceDB: Treats vector search as a specialized operation. The API usually
separates the search(vector) logic from standard sorting or limiting to allow
for vector-specific parameters (like probe counts or refinement steps).
PostgreSQL (pgvector): While it uses the ORDER BY syntax, the internal
execution plan typically utilizes a specialized k-NN Index Scan operator, which
functions quite differently from a standard Heap Sort TopN.
It seems that treating VectorSearch as an independent query type (sibling to
TopN), while still accepting the predicate result for filtering, might offer us
more flexibility in the future.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]