maedhroz commented on code in PR #4353: URL: https://github.com/apache/cassandra/pull/4353#discussion_r2677003649
########## src/java/org/apache/cassandra/index/sai/disk/v1/vector/BruteForceRowIdIterator.java: ########## @@ -0,0 +1,117 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.cassandra.index.sai.disk.v1.vector; + +import io.github.jbellis.jvector.graph.GraphIndex; +import io.github.jbellis.jvector.graph.NeighborQueue; +import io.github.jbellis.jvector.graph.NeighborSimilarity; +import org.apache.cassandra.io.util.FileUtils; +import org.apache.cassandra.utils.AbstractIterator; + + +/** + * An iterator over {@link RowIdWithScore} that lazily consumes from a {@link NeighborQueue} of approximate scores. + * <p> + * The idea is that we maintain the same level of accuracy as we would get from a graph search, by re-ranking the top + * `k` best approximate scores at a time with the full resolution vectors to return the top `limit`. + * <p> + * For example, suppose that limit=3 and k=5 and we have ten elements. After our first re-ranking batch, we have + * ABDEF????? + * We will return A, B, and D; if more elements are requested, we will re-rank another 5 (so three more, including + * the two remaining from the first batch). Here we uncover C, G, and H, and order them appropriately: + * CEFGH?? + * This illustrates that, also like a graph search, we only guarantee ordering of results within a re-ranking batch, + * not globally. + * <p> + * Note that we deliberately do not fetch new items from the approximate list until the first batch of `limit`-many + * is consumed. We do this because we expect that most often the first limit-many will pass the final verification + * and only query more if some didn't (e.g. because the vector was deleted in a newer sstable). + * <p> + * As an implementation detail, we use a heap to maintain state rather than a List and sorting. + */ +public class BruteForceRowIdIterator extends AbstractIterator<RowIdWithScore> Review Comment: nit: One thing I like to do with classes that have some non-trivial state like this is throw the `@NotThreadSafe` annotation on the class (when it's clear they are meant to be used by a single thread). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]

