xiangfu0 commented on code in PR #18852:
URL: https://github.com/apache/pinot/pull/18852#discussion_r3523751104
##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/index/loader/invertedindex/VectorIndexHandler.java:
##########
@@ -78,6 +85,31 @@ public boolean needUpdateIndices(SegmentDirectory.Reader
segmentReader) {
segmentName, column, existingBackend, desiredBackend);
return true;
}
+ // Migration: sidecar -> columns.psf. Fires either when the offline
writer just wrote a
+ // sidecar (build with flag on) or when the operator just flipped the
flag from off to on
+ // for an existing legacy segment. Same handler step covers both.
+ // For HNSW the "sidecar" is the combined packed file
(.vector.hnsw.combined.index);
+ // for IVF it is the legacy combined extension.
+ if (desiredConfig.isStoreInSegmentFile()
+ && _segmentDirectory.getSegmentMetadata().getVersion() ==
SegmentVersion.v3
+ && existingBackend != null
+ && hasCombinedFile(indexDir, column, desiredBackend)) {
+ LOGGER.info("Need to consolidate Vector sidecar file into columns.psf
for segment: {}, column: {}",
+ segmentName, column);
+ return true;
+ }
+ // Migration: columns.psf -> sidecar. Fires when the operator flipped
the flag from on to
+ // off for a segment whose vector payload was previously absorbed. We
detect this from the
+ // SegmentDirectory: the column has a vector index ({@code
existingColumns} contains it via
+ // {@code SingleFileIndexDirectory._columnEntries}) but no sidecar file
is on disk.
+ if (!desiredConfig.isStoreInSegmentFile()
+ && _segmentDirectory.getSegmentMetadata().getVersion() ==
SegmentVersion.v3
+ && existingBackend == null
Review Comment:
Fixed in 277ff1d02a (helpers in d8396fdcb5). `needUpdateIndices` and
`updateIndices` now share `getDriftedConsolidatedBackend`, which sniffs the
consolidated payload's leading magic (`LUCENE_V2` / `IVFF` / `IVPQ`, via the
new `VectorIndexUtils.detectConsolidatedVectorBackend`) when `existingBackend
== null` on a V3 segment. On a mismatch the column is rebuilt (`removeIndex` +
re-add) instead of running the absorb/extract migrations — so a consolidated
HNSW payload with the config flipped to IVF_PQ can no longer be extracted under
`.vector.ivfpq.index` and parsed by the wrong reader.
One subtlety the fix had to handle: `IVF_ON_DISK` shares the IVF_FLAT file
format, so its payloads sniff as `IVF_FLAT`. All four drift comparisons
(sidecar + consolidated, check + act) therefore compare via `storageFormatOf`
(IVF_ON_DISK → IVF_FLAT) — a raw enum comparison would have flagged a healthy
IVF_ON_DISK column as drifted and destroyed/rebuilt it on every segment load,
forever. This also fixes the same latent loop in the pre-existing sidecar drift
check. Unknown-magic payloads deliberately do NOT count as drift (extract
preserves the bytes; rebuilding would destroy an
unrecognized-but-possibly-valid payload).
Regression tests:
`testNeedUpdateIndicesReturnsTrueWhenConsolidatedBackendDiffers`,
`testUpdateIndicesRebuildsInsteadOfExtractingWhenConsolidatedBackendDiffers`
(asserts no `.vector.ivfpq.index` is written and the column reaches the rebuild
loop), `testUpdateIndicesExtractsWhenConsolidatedBackendMatches`, plus
IVF_ON_DISK no-drift tests for both consolidated and sidecar layouts, and
magic-sniff unit tests.
##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/index/vector/VectorIndexType.java:
##########
@@ -210,25 +211,85 @@ public VectorIndexReader
createIndexReader(SegmentDirectory.Reader segmentReader
return null;
}
VectorBackendType backendType = indexConfig.resolveBackendType();
- File configuredIndexFile =
- SegmentDirectoryPaths.findVectorIndexIndexFile(segmentDir,
metadata.getColumnName(), indexConfig);
- if (configuredIndexFile == null || !configuredIndexFile.exists()) {
- LOGGER.warn("Skipping vector index reader for column: {} because
configured backend {} does not have a "
- + "matching on-disk artifact in segment: {}",
- metadata.getColumnName(), backendType, segmentDir);
- return null;
+ String column = metadata.getColumnName();
+
+ if (backendType == VectorBackendType.HNSW) {
+ // Combined form: load the HNSW index from the typed entry inside
columns.psf when one
+ // actually exists. getConsolidatedVectorEntry (unlike hasIndexFor)
does not mistake the
+ // legacy Lucene directory for a packed entry, so a segment whose
handler has not yet
+ // migrated falls through to the legacy path below instead of failing
the whole load. The
+ // buffer is owned by the segment directory — this reader must not
close it.
+ if (indexConfig.isStoreInSegmentFile()) {
+ PinotDataBuffer buffer;
+ try {
+ buffer =
VectorIndexUtils.getConsolidatedVectorEntry(segmentReader, column);
Review Comment:
Fixed in 4cef636d3a (utility hardening in d8396fdcb5), two layers:
1. **Reader factory — legacy-directory-first**: with
`storeInSegmentFile=true`, the HNSW branch now checks the on-disk artifact
before probing columns.psf. If `findVectorIndexIndexFile(HNSW)` resolves a
Lucene *directory* (unmigrated V3, or any V1/V2 `FilePerIndexDirectory`
segment), it reads the directory directly and never calls `getIndexFor` — so
the `mapForReads` "must be a regular file" rejection can no longer kill the
load. Tests cover both the V1 root layout and the `v3/` subdirectory layout,
asserting via `verify(never())` that the probe does not run.
2. **Utility hardening**: `getConsolidatedVectorEntry` itself now maps
`FilePerIndexDirectory`'s directory-resolution `IllegalArgumentException`
(matched via a shared message-suffix constant) to `null` — a directory can
never be a packed typed entry — so any future ungated caller is protected too,
and its contract Javadoc now states the per-store behavior honestly (on V1/V2,
an IVF sidecar *file* is returned as-is; callers needing "truly consolidated"
must gate on version or on-disk artifacts).
Note: `TextIndexType.ReaderFactory` (which this vector machinery was modeled
on) has the identical unguarded probe on master for legacy `.lucene.index`
directories — filing that as a separate follow-up rather than expanding this PR.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]