xiangfu0 commented on code in PR #18852:
URL: https://github.com/apache/pinot/pull/18852#discussion_r3498628717
##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/index/vector/VectorIndexType.java:
##########
@@ -210,25 +211,85 @@ public VectorIndexReader
createIndexReader(SegmentDirectory.Reader segmentReader
return null;
}
VectorBackendType backendType = indexConfig.resolveBackendType();
- File configuredIndexFile =
- SegmentDirectoryPaths.findVectorIndexIndexFile(segmentDir,
metadata.getColumnName(), indexConfig);
- if (configuredIndexFile == null || !configuredIndexFile.exists()) {
- LOGGER.warn("Skipping vector index reader for column: {} because
configured backend {} does not have a "
- + "matching on-disk artifact in segment: {}",
- metadata.getColumnName(), backendType, segmentDir);
- return null;
+ String column = metadata.getColumnName();
+
+ if (backendType == VectorBackendType.HNSW) {
+ // Combined form: load the HNSW index from the typed entry inside
columns.psf when one
+ // actually exists. getConsolidatedVectorEntry (unlike hasIndexFor)
does not mistake the
+ // legacy Lucene directory for a packed entry, so a segment whose
handler has not yet
+ // migrated falls through to the legacy path below instead of failing
the whole load. The
+ // buffer is owned by the segment directory — this reader must not
close it.
+ if (indexConfig.isStoreInSegmentFile()) {
+ PinotDataBuffer buffer;
+ try {
+ buffer =
VectorIndexUtils.getConsolidatedVectorEntry(segmentReader, column);
Review Comment:
This consolidated-entry probe is not safe on `FilePerIndexDirectory`
segments that still have the legacy HNSW directory on disk. In that case
`getIndexFor(StandardIndexes.vector())` resolves the existing directory
candidate, `mapForReads()` rejects it because it is not a regular file, and the
method throws before the fallback at lines 236-237 can run. That makes
`storeInSegmentFile=true` non-rolling-upgrade-safe for existing V1/V2 HNSW
segments. Please gate this probe to the packed-file/V3 case or catch the
directory case and fall back to the legacy reader.
##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/index/loader/invertedindex/VectorIndexHandler.java:
##########
@@ -78,6 +85,31 @@ public boolean needUpdateIndices(SegmentDirectory.Reader
segmentReader) {
segmentName, column, existingBackend, desiredBackend);
return true;
}
+ // Migration: sidecar -> columns.psf. Fires either when the offline
writer just wrote a
+ // sidecar (build with flag on) or when the operator just flipped the
flag from off to on
+ // for an existing legacy segment. Same handler step covers both.
+ // For HNSW the "sidecar" is the combined packed file
(.vector.hnsw.combined.index);
+ // for IVF it is the legacy combined extension.
+ if (desiredConfig.isStoreInSegmentFile()
+ && _segmentDirectory.getSegmentMetadata().getVersion() ==
SegmentVersion.v3
+ && existingBackend != null
+ && hasCombinedFile(indexDir, column, desiredBackend)) {
+ LOGGER.info("Need to consolidate Vector sidecar file into columns.psf
for segment: {}, column: {}",
+ segmentName, column);
+ return true;
+ }
+ // Migration: columns.psf -> sidecar. Fires when the operator flipped
the flag from on to
+ // off for a segment whose vector payload was previously absorbed. We
detect this from the
+ // SegmentDirectory: the column has a vector index ({@code
existingColumns} contains it via
+ // {@code SingleFileIndexDirectory._columnEntries}) but no sidecar file
is on disk.
+ if (!desiredConfig.isStoreInSegmentFile()
+ && _segmentDirectory.getSegmentMetadata().getVersion() ==
SegmentVersion.v3
+ && existingBackend == null
Review Comment:
Once `existingColumns` can include a vector index that lives only in
`columns.psf`, `existingBackend == null` no longer means "there is no existing
backend". If a V3 segment already holds, say, a consolidated HNSW payload and
the table config flips to `IVF_PQ`, this branch treats it as a pure
`columns.psf -> sidecar` migration, extracts the old HNSW bytes under the
IVF_PQ extension, and removes the typed entry instead of rebuilding. The next
load then instantiates the new backend reader against the old backend payload.
This backend-drift check needs to infer the existing backend from the
consolidated entry too, not just from sidecar files.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]