Re: [PR] Consolidate IVF and HNSW vector indexes into segment columns.psf (opt-in via storeInSegmentFile) [pinot]

via GitHub Sat, 04 Jul 2026 12:35:41 -0700


xiangfu0 commented on code in PR #18852:
URL: https://github.com/apache/pinot/pull/18852#discussion_r3523751104



##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/index/loader/invertedindex/VectorIndexHandler.java:
##########
@@ -78,6 +85,31 @@ public boolean needUpdateIndices(SegmentDirectory.Reader 
segmentReader) {
             segmentName, column, existingBackend, desiredBackend);
         return true;
       }
+      // Migration: sidecar -> columns.psf. Fires either when the offline 
writer just wrote a
+      // sidecar (build with flag on) or when the operator just flipped the 
flag from off to on
+      // for an existing legacy segment. Same handler step covers both.
+      // For HNSW the "sidecar" is the combined packed file 
(.vector.hnsw.combined.index);
+      // for IVF it is the legacy combined extension.
+      if (desiredConfig.isStoreInSegmentFile()
+          && _segmentDirectory.getSegmentMetadata().getVersion() == 
SegmentVersion.v3
+          && existingBackend != null
+          && hasCombinedFile(indexDir, column, desiredBackend)) {
+        LOGGER.info("Need to consolidate Vector sidecar file into columns.psf 
for segment: {}, column: {}",
+            segmentName, column);
+        return true;
+      }
+      // Migration: columns.psf -> sidecar. Fires when the operator flipped 
the flag from on to
+      // off for a segment whose vector payload was previously absorbed. We 
detect this from the
+      // SegmentDirectory: the column has a vector index ({@code 
existingColumns} contains it via
+      // {@code SingleFileIndexDirectory._columnEntries}) but no sidecar file 
is on disk.
+      if (!desiredConfig.isStoreInSegmentFile()
+          && _segmentDirectory.getSegmentMetadata().getVersion() == 
SegmentVersion.v3
+          && existingBackend == null

Review Comment:
   Fixed in 277ff1d02a (helpers in d8396fdcb5). `needUpdateIndices` and 
`updateIndices` now share `getDriftedConsolidatedBackend`, which sniffs the 
consolidated payload's leading magic (`LUCENE_V2` / `IVFF` / `IVPQ`, via the 
new `VectorIndexUtils.detectConsolidatedVectorBackend`) when `existingBackend 
== null` on a V3 segment. On a mismatch the column is rebuilt (`removeIndex` + 
re-add) instead of running the absorb/extract migrations — so a consolidated 
HNSW payload with the config flipped to IVF_PQ can no longer be extracted under 
`.vector.ivfpq.index` and parsed by the wrong reader.
   
   One subtlety the fix had to handle: `IVF_ON_DISK` shares the IVF_FLAT file 
format, so its payloads sniff as `IVF_FLAT`. All four drift comparisons 
(sidecar + consolidated, check + act) therefore compare via `storageFormatOf` 
(IVF_ON_DISK → IVF_FLAT) — a raw enum comparison would have flagged a healthy 
IVF_ON_DISK column as drifted and destroyed/rebuilt it on every segment load, 
forever. This also fixes the same latent loop in the pre-existing sidecar drift 
check. Unknown-magic payloads deliberately do NOT count as drift (extract 
preserves the bytes; rebuilding would destroy an 
unrecognized-but-possibly-valid payload).
   
   Regression tests: 
`testNeedUpdateIndicesReturnsTrueWhenConsolidatedBackendDiffers`, 
`testUpdateIndicesRebuildsInsteadOfExtractingWhenConsolidatedBackendDiffers` 
(asserts no `.vector.ivfpq.index` is written and the column reaches the rebuild 
loop), `testUpdateIndicesExtractsWhenConsolidatedBackendMatches`, plus 
IVF_ON_DISK no-drift tests for both consolidated and sidecar layouts, and 
magic-sniff unit tests.
   



##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/index/vector/VectorIndexType.java:
##########
@@ -210,25 +211,85 @@ public VectorIndexReader 
createIndexReader(SegmentDirectory.Reader segmentReader
         return null;
       }
       VectorBackendType backendType = indexConfig.resolveBackendType();
-      File configuredIndexFile =
-          SegmentDirectoryPaths.findVectorIndexIndexFile(segmentDir, 
metadata.getColumnName(), indexConfig);
-      if (configuredIndexFile == null || !configuredIndexFile.exists()) {
-        LOGGER.warn("Skipping vector index reader for column: {} because 
configured backend {} does not have a "
-                + "matching on-disk artifact in segment: {}",
-            metadata.getColumnName(), backendType, segmentDir);
-        return null;
+      String column = metadata.getColumnName();
+
+      if (backendType == VectorBackendType.HNSW) {
+        // Combined form: load the HNSW index from the typed entry inside 
columns.psf when one
+        // actually exists. getConsolidatedVectorEntry (unlike hasIndexFor) 
does not mistake the
+        // legacy Lucene directory for a packed entry, so a segment whose 
handler has not yet
+        // migrated falls through to the legacy path below instead of failing 
the whole load. The
+        // buffer is owned by the segment directory — this reader must not 
close it.
+        if (indexConfig.isStoreInSegmentFile()) {
+          PinotDataBuffer buffer;
+          try {
+            buffer = 
VectorIndexUtils.getConsolidatedVectorEntry(segmentReader, column);

Review Comment:
   Fixed in 4cef636d3a (utility hardening in d8396fdcb5), two layers:
   
   1. **Reader factory — legacy-directory-first**: with 
`storeInSegmentFile=true`, the HNSW branch now checks the on-disk artifact 
before probing columns.psf. If `findVectorIndexIndexFile(HNSW)` resolves a 
Lucene *directory* (unmigrated V3, or any V1/V2 `FilePerIndexDirectory` 
segment), it reads the directory directly and never calls `getIndexFor` — so 
the `mapForReads` "must be a regular file" rejection can no longer kill the 
load. Tests cover both the V1 root layout and the `v3/` subdirectory layout, 
asserting via `verify(never())` that the probe does not run.
   
   2. **Utility hardening**: `getConsolidatedVectorEntry` itself now maps 
`FilePerIndexDirectory`'s directory-resolution `IllegalArgumentException` 
(matched via a shared message-suffix constant) to `null` — a directory can 
never be a packed typed entry — so any future ungated caller is protected too, 
and its contract Javadoc now states the per-store behavior honestly (on V1/V2, 
an IVF sidecar *file* is returned as-is; callers needing "truly consolidated" 
must gate on version or on-disk artifacts).
   
   Note: `TextIndexType.ReaderFactory` (which this vector machinery was modeled 
on) has the identical unguarded probe on master for legacy `.lucene.index` 
directories — filing that as a separate follow-up rather than expanding this PR.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Consolidate IVF and HNSW vector indexes into segment columns.psf (opt-in via storeInSegmentFile) [pinot]

Reply via email to