Copilot commented on code in PR #17994:
URL: https://github.com/apache/pinot/pull/17994#discussion_r2998598135


##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/index/readers/vector/IvfFlatVectorIndexReader.java:
##########
@@ -0,0 +1,330 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.pinot.segment.local.segment.index.readers.vector;
+
+import com.google.common.annotations.VisibleForTesting;
+import com.google.common.base.Preconditions;
+import java.io.DataInputStream;
+import java.io.File;
+import java.io.FileInputStream;
+import java.io.IOException;
+import java.util.Arrays;
+import java.util.PriorityQueue;
+import 
org.apache.pinot.segment.local.segment.index.vector.IvfFlatVectorIndexCreator;
+import org.apache.pinot.segment.local.utils.VectorDistanceFunction;
+import org.apache.pinot.segment.spi.V1Constants;
+import org.apache.pinot.segment.spi.index.creator.VectorIndexConfig;
+import org.apache.pinot.segment.spi.index.reader.NprobeAware;
+import org.apache.pinot.segment.spi.index.reader.VectorIndexReader;
+import org.roaringbitmap.buffer.MutableRoaringBitmap;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+
+/**
+ * Reader for IVF_FLAT (Inverted File with flat vectors) index.
+ *
+ * <p>Loads the entire index into memory at construction time for fast search.
+ * The search algorithm:
+ * <ol>
+ *   <li>Computes distance from the query to all centroids.</li>
+ *   <li>Selects the {@code nprobe} closest centroids.</li>
+ *   <li>Scans all vectors in those centroids' inverted lists.</li>
+ *   <li>Returns the top-K doc IDs as a bitmap.</li>
+ * </ol>
+ *
+ * <h3>Thread safety</h3>
+ * <p>This class is thread-safe for concurrent reads. The loaded index data is 
immutable
+ * after construction. The only mutable state is {@code _nprobe}, which is 
volatile to
+ * allow query-time tuning from another thread. However, the typical pattern is
+ * single-threaded: set nprobe, then call getDocIds.</p>
+ */
+public class IvfFlatVectorIndexReader implements VectorIndexReader, 
NprobeAware {
+  private static final Logger LOGGER = 
LoggerFactory.getLogger(IvfFlatVectorIndexReader.class);
+
+  /** Default nprobe value when not explicitly set. */
+  static final int DEFAULT_NPROBE = 4;
+
+  // Index data loaded from file
+  private final int _dimension;
+  private final int _numVectors;
+  private final int _nlist;
+  private final VectorIndexConfig.VectorDistanceFunction _distanceFunction;
+  private final float[][] _centroids;
+  private final int[][] _listDocIds;
+  private final float[][][] _listVectors;
+  private final String _column;
+
+  /** Number of centroids to probe during search. */
+  private volatile int _nprobe;
+
+  /**
+   * Opens and loads an IVF_FLAT index from disk.
+   *
+   * @param column    the column name
+   * @param indexDir  the segment index directory
+   * @param config    the vector index configuration
+   * @throws RuntimeException if the index file cannot be read or is corrupt
+   */
+  public IvfFlatVectorIndexReader(String column, File indexDir, 
VectorIndexConfig config) {
+    _column = column;
+
+    // Determine nprobe from config
+    int configuredNprobe = DEFAULT_NPROBE;
+    if (config.getProperties() != null && 
config.getProperties().containsKey("nprobe")) {
+      configuredNprobe = 
Integer.parseInt(config.getProperties().get("nprobe"));
+    }
+

Review Comment:
   The reader initializes nprobe from config properties key "nprobe", but the 
PR introduces query-time tuning via the query option key "vectorNprobe" (and 
the IVF_FLAT config validator doesn’t list/validate an index-time "nprobe" 
property). This is likely to confuse users and create unvalidated config 
behavior; consider removing config-based nprobe entirely (default to 
DEFAULT_NPROBE) and relying on NprobeAware#setNprobe for query-time tuning, or 
formally add/validate an index-time default nprobe property with consistent 
naming.
   ```suggestion
       // Initialize nprobe to the default; query-time tuning should use 
NprobeAware#setNprobe.
       int configuredNprobe = DEFAULT_NPROBE;
   ```



##########
pinot-core/src/main/java/org/apache/pinot/core/operator/filter/ExactVectorScanFilterOperator.java:
##########
@@ -0,0 +1,227 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.pinot.core.operator.filter;
+
+import com.google.common.base.CaseFormat;
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.List;
+import java.util.PriorityQueue;
+import 
org.apache.pinot.common.request.context.predicate.VectorSimilarityPredicate;
+import org.apache.pinot.core.common.BlockDocIdSet;
+import org.apache.pinot.core.common.Operator;
+import org.apache.pinot.core.operator.ExplainAttributeBuilder;
+import org.apache.pinot.core.operator.docidsets.BitmapDocIdSet;
+import org.apache.pinot.segment.spi.index.reader.ForwardIndexReader;
+import org.apache.pinot.segment.spi.index.reader.ForwardIndexReaderContext;
+import org.apache.pinot.spi.data.FieldSpec;
+import org.apache.pinot.spi.trace.FilterType;
+import org.apache.pinot.spi.trace.InvocationRecording;
+import org.apache.pinot.spi.trace.Tracing;
+import org.roaringbitmap.buffer.ImmutableRoaringBitmap;
+import org.roaringbitmap.buffer.MutableRoaringBitmap;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+
+/**
+ * Fallback operator that performs exact brute-force vector similarity search 
by scanning the forward index.
+ *
+ * <p>This operator is used when no ANN vector index exists on a segment for 
the target column
+ * (e.g., the segment was built before the vector index was added, or the 
index type is not
+ * supported). It reads all vectors from the forward index, computes exact 
distances to the
+ * query vector, and returns the top-K closest document IDs.</p>
+ *
+ * <p>The distance computation uses L2 (Euclidean) squared distance. For 
COSINE similarity,
+ * vectors should be pre-normalized. This matches the behavior of Lucene's 
HNSW implementation.</p>
+ *
+ * <p>This operator is intentionally simple and correct rather than fast -- it 
is a safety net.
+ * A warning is logged when this operator is used because it scans all 
documents in the segment.</p>
+ *
+ * <p>This class is thread-safe for single-threaded execution per query (same 
as other filter operators).</p>
+ */
+public class ExactVectorScanFilterOperator extends BaseFilterOperator {
+  private static final Logger LOGGER = 
LoggerFactory.getLogger(ExactVectorScanFilterOperator.class);
+  private static final String EXPLAIN_NAME = "VECTOR_SIMILARITY_EXACT_SCAN";
+
+  private final ForwardIndexReader<?> _forwardIndexReader;
+  private final VectorSimilarityPredicate _predicate;
+  private final String _column;
+  private ImmutableRoaringBitmap _matches;
+
+  /**
+   * Creates an exact scan operator.
+   *
+   * @param forwardIndexReader the forward index reader for the vector column
+   * @param predicate the vector similarity predicate containing query vector 
and top-K
+   * @param column the column name (for logging and explain)
+   * @param numDocs the total number of documents in the segment
+   */
+  public ExactVectorScanFilterOperator(ForwardIndexReader<?> 
forwardIndexReader,
+      VectorSimilarityPredicate predicate, String column, int numDocs) {
+    super(numDocs, false);
+    _forwardIndexReader = forwardIndexReader;
+    _predicate = predicate;
+    _column = column;
+  }
+
+  @Override
+  protected BlockDocIdSet getTrues() {
+    if (_matches == null) {
+      _matches = computeExactTopK();
+    }
+    return new BitmapDocIdSet(_matches, _numDocs);
+  }
+
+  @Override
+  public int getNumMatchingDocs() {
+    if (_matches == null) {
+      _matches = computeExactTopK();
+    }
+    return _matches.getCardinality();
+  }
+
+  @Override
+  public boolean canProduceBitmaps() {
+    return true;
+  }
+
+  @Override
+  public BitmapCollection getBitmaps() {
+    if (_matches == null) {
+      _matches = computeExactTopK();
+    }
+    record(_matches);
+    return new BitmapCollection(_numDocs, false, _matches);
+  }
+
+  @Override
+  public List<Operator> getChildOperators() {
+    return Collections.emptyList();
+  }
+
+  @Override
+  public String toExplainString() {
+    return EXPLAIN_NAME + "(indexLookUp:exact_scan"
+        + ", operator:" + _predicate.getType()
+        + ", vector identifier:" + _column
+        + ", vector literal:" + Arrays.toString(_predicate.getValue())
+        + ", topK to search:" + _predicate.getTopK()
+        + ')';
+  }
+
+  @Override
+  protected String getExplainName() {
+    return CaseFormat.UPPER_UNDERSCORE.to(CaseFormat.UPPER_CAMEL, 
EXPLAIN_NAME);
+  }
+
+  @Override
+  protected void explainAttributes(ExplainAttributeBuilder attributeBuilder) {
+    super.explainAttributes(attributeBuilder);
+    attributeBuilder.putString("indexLookUp", "exact_scan");
+    attributeBuilder.putString("operator", _predicate.getType().name());
+    attributeBuilder.putString("vectorIdentifier", _column);
+    attributeBuilder.putString("vectorLiteral", 
Arrays.toString(_predicate.getValue()));
+    attributeBuilder.putLongIdempotent("topKtoSearch", _predicate.getTopK());
+  }
+
+  /**
+   * Performs brute-force exact search over all documents in the segment.
+   * Uses a max-heap to maintain the top-K closest vectors.
+   */
+  @SuppressWarnings("unchecked")
+  private ImmutableRoaringBitmap computeExactTopK() {
+    LOGGER.warn("Performing exact vector scan fallback on column: {} for 
segment with {} docs. "
+        + "This is expensive -- consider adding a vector index.", _column, 
_numDocs);
+
+    float[] queryVector = _predicate.getValue();
+    int topK = _predicate.getTopK();
+
+    // Max-heap: entry with largest distance is at the top so we can 
efficiently evict it
+    PriorityQueue<DocDistance> maxHeap = new PriorityQueue<>(topK + 1,
+        (a, b) -> Float.compare(b._distance, a._distance));
+
+    ForwardIndexReader rawReader = _forwardIndexReader;
+    try (ForwardIndexReaderContext context = rawReader.createContext()) {
+      for (int docId = 0; docId < _numDocs; docId++) {
+        float[] docVector = rawReader.getFloatMV(docId, context);
+        if (docVector == null || docVector.length == 0) {
+          continue;
+        }
+        float distance = computeL2SquaredDistance(queryVector, docVector);
+        if (maxHeap.size() < topK) {
+          maxHeap.add(new DocDistance(docId, distance));
+        } else if (distance < maxHeap.peek()._distance) {
+          maxHeap.poll();
+          maxHeap.add(new DocDistance(docId, distance));
+        }
+      }
+    } catch (Exception e) {
+      throw new RuntimeException("Error during exact vector scan on column: " 
+ _column, e);
+    }
+
+    MutableRoaringBitmap result = new MutableRoaringBitmap();
+    for (DocDistance dd : maxHeap) {
+      result.add(dd._docId);
+    }
+
+    LOGGER.debug("Exact vector scan on column: {} returned {} results from {} 
docs",
+        _column, result.getCardinality(), _numDocs);
+
+    return result;
+  }
+
+  /**
+   * Computes the squared L2 (Euclidean) distance between two vectors.
+   * Uses squared distance to avoid the sqrt, which is monotonic so does not 
affect ranking.
+   */
+  static float computeL2SquaredDistance(float[] a, float[] b) {
+    int len = Math.min(a.length, b.length);

Review Comment:
   computeL2SquaredDistance() uses Math.min(a.length, b.length) and silently 
ignores extra dimensions when lengths differ. This can hide data issues and 
produce wrong rankings; it should validate that dimensions match (or fail fast) 
since vectorSimilarity semantics assume a fixed dimension per column.
   ```suggestion
       if (a.length != b.length) {
         throw new IllegalArgumentException(
             "Vector length mismatch when computing L2 distance: query length=" 
+ a.length
                 + ", document length=" + b.length);
       }
       int len = a.length;
   ```



##########
pinot-core/src/main/java/org/apache/pinot/core/operator/filter/VectorSimilarityFilterOperator.java:
##########
@@ -120,6 +157,104 @@ protected void explainAttributes(ExplainAttributeBuilder 
attributeBuilder) {
     attributeBuilder.putString("vectorIdentifier", 
_predicate.getLhs().getIdentifier());
     attributeBuilder.putString("vectorLiteral", 
Arrays.toString(_predicate.getValue()));
     attributeBuilder.putLongIdempotent("topKtoSearch", _predicate.getTopK());
+    if (_searchParams.isExactRerank()) {
+      attributeBuilder.putString("exactRerank", "true");
+    }
+  }
+
+  /**
+   * Executes the vector search with backend-specific parameter dispatch and 
optional rerank.
+   */
+  private ImmutableRoaringBitmap executeSearch() {
+    String column = _predicate.getLhs().getIdentifier();
+    float[] queryVector = _predicate.getValue();
+    int topK = _predicate.getTopK();
+
+    // 1. Configure backend-specific parameters via interfaces
+    configureBackendParams(column);
+
+    // 2. Determine effective search count (higher if rerank is enabled)
+    int searchCount = topK;
+    if (_searchParams.isExactRerank()) {
+      searchCount = _searchParams.getEffectiveMaxCandidates(topK);
+    }
+
+    // 3. Execute ANN search
+    ImmutableRoaringBitmap annResults = 
_vectorIndexReader.getDocIds(queryVector, searchCount);
+    int annCandidateCount = annResults.getCardinality();
+
+    LOGGER.debug("Vector search on column: {}, backend: {}, topK: {}, 
searchCount: {}, annCandidates: {}",
+        column, getBackendName(), topK, searchCount, annCandidateCount);
+
+    // 4. Apply exact rerank if requested
+    if (_searchParams.isExactRerank() && _forwardIndexReader != null && 
annCandidateCount > 0) {
+      ImmutableRoaringBitmap reranked = applyExactRerank(annResults, 
queryVector, topK, column);
+      LOGGER.debug("Exact rerank on column: {}, candidates: {} -> final: {}",
+          column, annCandidateCount, reranked.getCardinality());
+      return reranked;
+    }
+
+    return annResults;
+  }
+
+  /**
+   * Configures backend-specific search parameters on the reader if it 
supports them.
+   */
+  private void configureBackendParams(String column) {
+    // Set nprobe on IVF_FLAT readers
+    if (_vectorIndexReader instanceof NprobeAware) {
+      int nprobe = _searchParams.getNprobe();
+      ((NprobeAware) _vectorIndexReader).setNprobe(nprobe);
+      LOGGER.debug("Set nprobe={} on IVF_FLAT reader for column: {}", nprobe, 
column);
+    }
+  }
+
+  /**
+   * Re-scores ANN candidates using exact distance from the forward index and 
returns top-K.
+   */
+  @SuppressWarnings("unchecked")
+  private ImmutableRoaringBitmap applyExactRerank(ImmutableRoaringBitmap 
annResults, float[] queryVector,
+      int topK, String column) {
+    // Max-heap: largest distance on top for efficient eviction
+    PriorityQueue<DocDistance> maxHeap = new PriorityQueue<>(topK + 1,
+        (a, b) -> Float.compare(b._distance, a._distance));
+
+    ForwardIndexReader rawReader = _forwardIndexReader;
+    try (ForwardIndexReaderContext context = rawReader.createContext()) {
+      org.roaringbitmap.IntIterator it = annResults.getIntIterator();
+      while (it.hasNext()) {
+        int docId = it.next();
+        float[] docVector = rawReader.getFloatMV(docId, context);
+        if (docVector == null || docVector.length == 0) {
+          continue;
+        }
+        float distance = 
ExactVectorScanFilterOperator.computeL2SquaredDistance(queryVector, docVector);
+        if (maxHeap.size() < topK) {
+          maxHeap.add(new DocDistance(docId, distance));

Review Comment:
   Exact rerank re-scores ANN candidates using L2 squared distance 
unconditionally. This will return incorrect top-K ordering for COSINE / 
INNER_PRODUCT / DOT_PRODUCT configured segments (and COSINE unless vectors are 
guaranteed normalized). Consider deriving the distance function from the 
segment’s vector index config (or exposing it via VectorIndexReader) and 
re-scoring with the same metric used by the ANN backend.



##########
pinot-core/src/main/java/org/apache/pinot/core/operator/filter/ExactVectorScanFilterOperator.java:
##########
@@ -0,0 +1,227 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.pinot.core.operator.filter;
+
+import com.google.common.base.CaseFormat;
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.List;
+import java.util.PriorityQueue;
+import 
org.apache.pinot.common.request.context.predicate.VectorSimilarityPredicate;
+import org.apache.pinot.core.common.BlockDocIdSet;
+import org.apache.pinot.core.common.Operator;
+import org.apache.pinot.core.operator.ExplainAttributeBuilder;
+import org.apache.pinot.core.operator.docidsets.BitmapDocIdSet;
+import org.apache.pinot.segment.spi.index.reader.ForwardIndexReader;
+import org.apache.pinot.segment.spi.index.reader.ForwardIndexReaderContext;
+import org.apache.pinot.spi.data.FieldSpec;
+import org.apache.pinot.spi.trace.FilterType;
+import org.apache.pinot.spi.trace.InvocationRecording;
+import org.apache.pinot.spi.trace.Tracing;
+import org.roaringbitmap.buffer.ImmutableRoaringBitmap;
+import org.roaringbitmap.buffer.MutableRoaringBitmap;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+
+/**
+ * Fallback operator that performs exact brute-force vector similarity search 
by scanning the forward index.
+ *
+ * <p>This operator is used when no ANN vector index exists on a segment for 
the target column
+ * (e.g., the segment was built before the vector index was added, or the 
index type is not
+ * supported). It reads all vectors from the forward index, computes exact 
distances to the
+ * query vector, and returns the top-K closest document IDs.</p>
+ *
+ * <p>The distance computation uses L2 (Euclidean) squared distance. For 
COSINE similarity,
+ * vectors should be pre-normalized. This matches the behavior of Lucene's 
HNSW implementation.</p>
+ *
+ * <p>This operator is intentionally simple and correct rather than fast -- it 
is a safety net.
+ * A warning is logged when this operator is used because it scans all 
documents in the segment.</p>
+ *
+ * <p>This class is thread-safe for single-threaded execution per query (same 
as other filter operators).</p>
+ */
+public class ExactVectorScanFilterOperator extends BaseFilterOperator {
+  private static final Logger LOGGER = 
LoggerFactory.getLogger(ExactVectorScanFilterOperator.class);
+  private static final String EXPLAIN_NAME = "VECTOR_SIMILARITY_EXACT_SCAN";
+
+  private final ForwardIndexReader<?> _forwardIndexReader;
+  private final VectorSimilarityPredicate _predicate;
+  private final String _column;
+  private ImmutableRoaringBitmap _matches;
+
+  /**
+   * Creates an exact scan operator.
+   *
+   * @param forwardIndexReader the forward index reader for the vector column
+   * @param predicate the vector similarity predicate containing query vector 
and top-K
+   * @param column the column name (for logging and explain)
+   * @param numDocs the total number of documents in the segment
+   */
+  public ExactVectorScanFilterOperator(ForwardIndexReader<?> 
forwardIndexReader,
+      VectorSimilarityPredicate predicate, String column, int numDocs) {
+    super(numDocs, false);
+    _forwardIndexReader = forwardIndexReader;
+    _predicate = predicate;
+    _column = column;
+  }
+
+  @Override
+  protected BlockDocIdSet getTrues() {
+    if (_matches == null) {
+      _matches = computeExactTopK();
+    }
+    return new BitmapDocIdSet(_matches, _numDocs);
+  }
+
+  @Override
+  public int getNumMatchingDocs() {
+    if (_matches == null) {
+      _matches = computeExactTopK();
+    }
+    return _matches.getCardinality();
+  }
+
+  @Override
+  public boolean canProduceBitmaps() {
+    return true;
+  }
+
+  @Override
+  public BitmapCollection getBitmaps() {
+    if (_matches == null) {
+      _matches = computeExactTopK();
+    }
+    record(_matches);
+    return new BitmapCollection(_numDocs, false, _matches);
+  }
+
+  @Override
+  public List<Operator> getChildOperators() {
+    return Collections.emptyList();
+  }
+
+  @Override
+  public String toExplainString() {
+    return EXPLAIN_NAME + "(indexLookUp:exact_scan"
+        + ", operator:" + _predicate.getType()
+        + ", vector identifier:" + _column
+        + ", vector literal:" + Arrays.toString(_predicate.getValue())
+        + ", topK to search:" + _predicate.getTopK()
+        + ')';
+  }
+
+  @Override
+  protected String getExplainName() {
+    return CaseFormat.UPPER_UNDERSCORE.to(CaseFormat.UPPER_CAMEL, 
EXPLAIN_NAME);
+  }
+
+  @Override
+  protected void explainAttributes(ExplainAttributeBuilder attributeBuilder) {
+    super.explainAttributes(attributeBuilder);
+    attributeBuilder.putString("indexLookUp", "exact_scan");
+    attributeBuilder.putString("operator", _predicate.getType().name());
+    attributeBuilder.putString("vectorIdentifier", _column);
+    attributeBuilder.putString("vectorLiteral", 
Arrays.toString(_predicate.getValue()));
+    attributeBuilder.putLongIdempotent("topKtoSearch", _predicate.getTopK());
+  }
+
+  /**
+   * Performs brute-force exact search over all documents in the segment.
+   * Uses a max-heap to maintain the top-K closest vectors.
+   */
+  @SuppressWarnings("unchecked")
+  private ImmutableRoaringBitmap computeExactTopK() {
+    LOGGER.warn("Performing exact vector scan fallback on column: {} for 
segment with {} docs. "
+        + "This is expensive -- consider adding a vector index.", _column, 
_numDocs);
+
+    float[] queryVector = _predicate.getValue();
+    int topK = _predicate.getTopK();
+
+    // Max-heap: entry with largest distance is at the top so we can 
efficiently evict it
+    PriorityQueue<DocDistance> maxHeap = new PriorityQueue<>(topK + 1,
+        (a, b) -> Float.compare(b._distance, a._distance));
+
+    ForwardIndexReader rawReader = _forwardIndexReader;
+    try (ForwardIndexReaderContext context = rawReader.createContext()) {
+      for (int docId = 0; docId < _numDocs; docId++) {
+        float[] docVector = rawReader.getFloatMV(docId, context);
+        if (docVector == null || docVector.length == 0) {
+          continue;
+        }
+        float distance = computeL2SquaredDistance(queryVector, docVector);

Review Comment:
   Exact scan fallback always computes L2 squared distance (see the 
computeL2SquaredDistance call below), regardless of the column’s configured 
vector distance function. For non-EUCLIDEAN backends/configs, this can return 
incorrect results. Either restrict this fallback to EUCLIDEAN/L2 segments (and 
error otherwise), or pass/derive the configured distance function and compute 
exact distance accordingly.



##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/utils/VectorDistanceFunction.java:
##########
@@ -0,0 +1,212 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.pinot.segment.local.utils;
+
+import org.apache.pinot.segment.spi.index.creator.VectorIndexConfig;
+
+
+/**
+ * Pure-Java implementations of vector distance functions used by the IVF_FLAT 
index.
+ *
+ * <p>All distance functions return a non-negative value where lower means 
more similar.
+ * This convention allows a single min-heap to be used for top-K selection 
regardless
+ * of the distance metric.</p>

Review Comment:
   The class-level Javadoc says “All distance functions return a non-negative 
value”, but dot/inner-product distance is defined as -dot(a,b) and can be 
negative. Please adjust the wording to match the implementation (e.g., “lower 
means more similar; values may be negative for dot/inner product”).
   ```suggestion
    * <p>Distance functions are defined so that lower values mean more similar; 
values may be
    * negative for inner-product / dot-product distance. This convention allows 
a single
    * min-heap to be used for top-K selection regardless of the distance 
metric.</p>
   ```



##########
pinot-segment-spi/src/main/java/org/apache/pinot/segment/spi/index/creator/VectorIndexConfigValidator.java:
##########
@@ -0,0 +1,211 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.pinot.segment.spi.index.creator;
+
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.HashSet;
+import java.util.Map;
+import java.util.Set;
+
+
+/**
+ * Validates {@link VectorIndexConfig} for backend-specific correctness.
+ *
+ * <p>This validator ensures that:
+ * <ul>
+ *   <li>Required common fields (vectorDimension, vectorDistanceFunction) are 
present and valid.</li>
+ *   <li>The vectorIndexType resolves to a known {@link 
VectorBackendType}.</li>
+ *   <li>Backend-specific properties are valid for the resolved backend 
type.</li>
+ *   <li>Properties belonging to a different backend are rejected with a clear 
error message.</li>
+ * </ul>
+ *
+ * <p>Thread-safe: this class is stateless and all methods are static.</p>
+ */
+public final class VectorIndexConfigValidator {
+
+  // HNSW-specific property keys
+  static final Set<String> HNSW_PROPERTIES = Collections.unmodifiableSet(new 
HashSet<>(
+      Arrays.asList("maxCon", "beamWidth", "maxDimensions", "maxBufferSizeMB",
+          "useCompoundFile", "mode", "commit", "commitIntervalMs", 
"commitDocs")));
+
+  // IVF_FLAT-specific property keys
+  static final Set<String> IVF_FLAT_PROPERTIES = 
Collections.unmodifiableSet(new HashSet<>(
+      Arrays.asList("nlist", "trainSampleSize", "trainingSeed", 
"minRowsForIndex")));
+
+  // Common property keys that appear in the properties map (legacy format 
stores common fields there too)
+  private static final Set<String> COMMON_PROPERTIES = 
Collections.unmodifiableSet(new HashSet<>(
+      Arrays.asList("vectorIndexType", "vectorDimension", 
"vectorDistanceFunction", "version")));
+
+  private VectorIndexConfigValidator() {
+  }
+
+  /**
+   * Validates the given {@link VectorIndexConfig} for backend-specific 
correctness.
+   *
+   * @param config the config to validate
+   * @throws IllegalArgumentException if validation fails
+   */
+  public static void validate(VectorIndexConfig config) {
+    if (config.isDisabled()) {
+      return;
+    }
+
+    VectorBackendType backendType = resolveBackendType(config);
+    validateCommonFields(config);
+    validateBackendSpecificProperties(config, backendType);
+  }
+
+  /**
+   * Resolves the {@link VectorBackendType} from the config. Defaults to HNSW 
if the
+   * vectorIndexType field is null or empty, preserving backward compatibility.
+   *
+   * @param config the config to resolve from
+   * @return the resolved backend type
+   * @throws IllegalArgumentException if the vectorIndexType is not recognized
+   */
+  public static VectorBackendType resolveBackendType(VectorIndexConfig config) 
{
+    String typeString = config.getVectorIndexType();
+    if (typeString == null || typeString.isEmpty()) {
+      return VectorBackendType.HNSW;
+    }
+    return VectorBackendType.fromString(typeString);
+  }
+
+  /**
+   * Validates common fields shared across all backend types.
+   */
+  private static void validateCommonFields(VectorIndexConfig config) {
+    if (config.getVectorDimension() <= 0) {
+      throw new IllegalArgumentException(
+          "vectorDimension must be a positive integer, got: " + 
config.getVectorDimension());
+    }
+
+    if (config.getVectorDistanceFunction() == null) {
+      throw new IllegalArgumentException("vectorDistanceFunction is required");
+    }
+  }
+
+  /**
+   * Validates that the properties map only contains keys valid for the 
resolved backend type,
+   * and that backend-specific property values are within acceptable ranges.
+   */
+  private static void validateBackendSpecificProperties(VectorIndexConfig 
config, VectorBackendType backendType) {
+    Map<String, String> properties = config.getProperties();
+    if (properties == null || properties.isEmpty()) {
+      return;
+    }
+
+    switch (backendType) {
+      case HNSW:
+        validateNoForeignProperties(properties, HNSW_PROPERTIES, 
IVF_FLAT_PROPERTIES, "HNSW", "IVF_FLAT");
+        validateHnswProperties(properties);
+        break;
+      case IVF_FLAT:
+        validateNoForeignProperties(properties, IVF_FLAT_PROPERTIES, 
HNSW_PROPERTIES, "IVF_FLAT", "HNSW");
+        validateIvfFlatProperties(properties);
+        break;
+      default:
+        throw new IllegalArgumentException("Unsupported vector backend type: " 
+ backendType);
+    }
+  }
+
+  /**
+   * Ensures that properties belonging to a foreign backend are not present.
+   */
+  private static void validateNoForeignProperties(Map<String, String> 
properties,
+      Set<String> ownProperties, Set<String> foreignProperties,
+      String ownType, String foreignType) {
+    for (String key : properties.keySet()) {
+      if (COMMON_PROPERTIES.contains(key)) {

Review Comment:
   VectorIndexConfigValidator Javadoc claims the properties map “only contains 
keys valid for the resolved backend type”, but validateNoForeignProperties() 
only rejects keys from the other backend and allows arbitrary unknown keys. 
Also, the ownProperties parameter is currently unused. Either enforce an 
allowlist (ownProperties ∪ COMMON_PROPERTIES) to catch typos/misconfigurations, 
or update the Javadoc/signature to reflect the intended behavior.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to