Re: [PR] Consolidate IVF and HNSW vector indexes into segment columns.psf (opt-in via storeInSegmentFile) [pinot]

via GitHub Tue, 30 Jun 2026 05:18:25 -0700


xiangfu0 commented on code in PR #18852:
URL: https://github.com/apache/pinot/pull/18852#discussion_r3498628717



##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/index/vector/VectorIndexType.java:
##########
@@ -210,25 +211,85 @@ public VectorIndexReader 
createIndexReader(SegmentDirectory.Reader segmentReader
         return null;
       }
       VectorBackendType backendType = indexConfig.resolveBackendType();
-      File configuredIndexFile =
-          SegmentDirectoryPaths.findVectorIndexIndexFile(segmentDir, 
metadata.getColumnName(), indexConfig);
-      if (configuredIndexFile == null || !configuredIndexFile.exists()) {
-        LOGGER.warn("Skipping vector index reader for column: {} because 
configured backend {} does not have a "
-                + "matching on-disk artifact in segment: {}",
-            metadata.getColumnName(), backendType, segmentDir);
-        return null;
+      String column = metadata.getColumnName();
+
+      if (backendType == VectorBackendType.HNSW) {
+        // Combined form: load the HNSW index from the typed entry inside 
columns.psf when one
+        // actually exists. getConsolidatedVectorEntry (unlike hasIndexFor) 
does not mistake the
+        // legacy Lucene directory for a packed entry, so a segment whose 
handler has not yet
+        // migrated falls through to the legacy path below instead of failing 
the whole load. The
+        // buffer is owned by the segment directory — this reader must not 
close it.
+        if (indexConfig.isStoreInSegmentFile()) {
+          PinotDataBuffer buffer;
+          try {
+            buffer = 
VectorIndexUtils.getConsolidatedVectorEntry(segmentReader, column);

Review Comment:
   This consolidated-entry probe is not safe on `FilePerIndexDirectory` 
segments that still have the legacy HNSW directory on disk. In that case 
`getIndexFor(StandardIndexes.vector())` resolves the existing directory 
candidate, `mapForReads()` rejects it because it is not a regular file, and the 
method throws before the fallback at lines 236-237 can run. That makes 
`storeInSegmentFile=true` non-rolling-upgrade-safe for existing V1/V2 HNSW 
segments. Please gate this probe to the packed-file/V3 case or catch the 
directory case and fall back to the legacy reader.



##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/index/loader/invertedindex/VectorIndexHandler.java:
##########
@@ -78,6 +85,31 @@ public boolean needUpdateIndices(SegmentDirectory.Reader 
segmentReader) {
             segmentName, column, existingBackend, desiredBackend);
         return true;
       }
+      // Migration: sidecar -> columns.psf. Fires either when the offline 
writer just wrote a
+      // sidecar (build with flag on) or when the operator just flipped the 
flag from off to on
+      // for an existing legacy segment. Same handler step covers both.
+      // For HNSW the "sidecar" is the combined packed file 
(.vector.hnsw.combined.index);
+      // for IVF it is the legacy combined extension.
+      if (desiredConfig.isStoreInSegmentFile()
+          && _segmentDirectory.getSegmentMetadata().getVersion() == 
SegmentVersion.v3
+          && existingBackend != null
+          && hasCombinedFile(indexDir, column, desiredBackend)) {
+        LOGGER.info("Need to consolidate Vector sidecar file into columns.psf 
for segment: {}, column: {}",
+            segmentName, column);
+        return true;
+      }
+      // Migration: columns.psf -> sidecar. Fires when the operator flipped 
the flag from on to
+      // off for a segment whose vector payload was previously absorbed. We 
detect this from the
+      // SegmentDirectory: the column has a vector index ({@code 
existingColumns} contains it via
+      // {@code SingleFileIndexDirectory._columnEntries}) but no sidecar file 
is on disk.
+      if (!desiredConfig.isStoreInSegmentFile()
+          && _segmentDirectory.getSegmentMetadata().getVersion() == 
SegmentVersion.v3
+          && existingBackend == null

Review Comment:
   Once `existingColumns` can include a vector index that lives only in 
`columns.psf`, `existingBackend == null` no longer means "there is no existing 
backend". If a V3 segment already holds, say, a consolidated HNSW payload and 
the table config flips to `IVF_PQ`, this branch treats it as a pure 
`columns.psf -> sidecar` migration, extracts the old HNSW bytes under the 
IVF_PQ extension, and removes the typed entry instead of rebuilding. The next 
load then instantiates the new backend reader against the old backend payload. 
This backend-drift check needs to infer the existing backend from the 
consolidated entry too, not just from sidecar files.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Consolidate IVF and HNSW vector indexes into segment columns.psf (opt-in via storeInSegmentFile) [pinot]

Reply via email to