Re: [PR] Consolidate IVF and HNSW vector indexes into segment columns.psf (opt-in via storeInSegmentFile) [pinot]

via GitHub Sat, 27 Jun 2026 05:52:35 -0700


Copilot commented on code in PR #18852:
URL: https://github.com/apache/pinot/pull/18852#discussion_r3486025461



##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/store/SingleFileIndexDirectory.java:
##########
@@ -525,12 +531,14 @@ public Set<String> getColumnsWithIndex(IndexType<?, ?, ?> 
type) {
       }
     }
     if (type == StandardIndexes.vector()) {
+      // Vector may live as a combined file (legacy / 
storeInSegmentFile=false) or as a typed
+      // entry in columns.psf (storeInSegmentFile=true). Collect both. Removed 
the early-return
+      // that previously hid consolidated entries from this view.
       for (String column : _segmentMetadata.getAllColumns()) {
         if (VectorIndexUtils.hasVectorIndex(_segmentDirectory, column)) {
           columns.add(column);
         }
       }

Review Comment:
   `getColumnsWithIndex(StandardIndexes.vector())` only adds columns when 
`hasVectorIndex(...)` is true, so columns that only have the transient 
`*.combined.index` form will be omitted. That can break handler/loader logic 
that relies on this set during migrations or crash recovery.



##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/store/SingleFileIndexDirectory.java:
##########
@@ -160,8 +160,11 @@ public boolean hasIndexFor(String column, IndexType<?, ?, 
?> type) {
     if (type == StandardIndexes.text() && 
TextIndexUtils.hasTextIndex(_segmentDirectory, column)) {
       return true;
     }
-    if (type == StandardIndexes.vector()) {
-      return VectorIndexUtils.hasVectorIndex(_segmentDirectory, column);
+    // Vector index may live either as a combined file (legacy / 
storeInSegmentFile=false) or as
+    // a typed entry inside columns.psf (storeInSegmentFile=true). Check both 
— mirror the text
+    // pattern of "combined OR _columnEntries".
+    if (type == StandardIndexes.vector() && 
VectorIndexUtils.hasVectorIndex(_segmentDirectory, column)) {
+      return true;
     }

Review Comment:
   `hasIndexFor(..., StandardIndexes.vector())` only checks 
`VectorIndexUtils.hasVectorIndex`, which deliberately excludes the new 
`*.combined.index` transient form. If a segment directory temporarily contains 
only the combined-form vector index file (e.g. crash/rollback before absorb), 
`hasIndexFor` will return false and the vector reader will never be constructed 
even though `SegmentDirectoryPaths.findVectorIndexIndexFile(...)` can now 
resolve the combined file.



##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/store/VectorIndexUtils.java:
##########
@@ -76,6 +95,39 @@ static boolean hasVectorIndex(File segDir, String column) {
         || new File(segDir, column + 
Indexes.VECTOR_IVF_PQ_INDEX_FILE_EXTENSION).exists();
   }
 
+  /// Returns {@code true} when the V1/V2 segment directory holds an IVF 
vector index in the
+  /// combined-form extension ({@code .vector.ivfflat.combined.index} or
+  /// {@code .vector.ivfpq.combined.index}). The combined form is written by 
an IVF creator run
+  /// with {@code storeInSegmentFile=true} and is meant to be packed into 
{@code columns.psf} by
+  /// the V2→V3 converter, not preserved as a sibling.
+  public static boolean hasCombinedFormVectorIndex(File segDir, String column) 
{
+    return new File(segDir, column + 
Indexes.VECTOR_IVF_FLAT_COMBINED_INDEX_FILE_EXTENSION).exists()
+        || new File(segDir, column + 
Indexes.VECTOR_IVF_PQ_COMBINED_INDEX_FILE_EXTENSION).exists()
+        || new File(segDir, column + 
Indexes.VECTOR_HNSW_COMBINED_INDEX_FILE_EXTENSION).exists();
+  }
+
+  /// Returns the {@code columns.psf} typed-entry buffer holding the column's 
consolidated vector
+  /// index, or {@code null} when no such entry has been packed into {@code 
columns.psf} yet.
+  ///
+  /// Unlike {@link SegmentDirectory.Reader#hasIndexFor}, this does NOT report 
a legacy on-disk
+  /// sidecar (an IVF flat file or an HNSW Lucene directory) as a match — only 
a real packed
+  /// `_columnEntries` slot counts. {@code SingleFileIndexDirectory} signals 
an absent typed slot by
+  /// throwing an unchecked exception from {@code getIndexFor}; that is mapped 
to {@code null} here,
+  /// while genuine I/O failures propagate. Callers use this to tell "a prior 
absorb already
+  /// committed bytes" (crash recovery) apart from "first absorb of an 
existing sidecar", and to
+  /// select the {@code columns.psf} read path only when the consolidated 
entry truly exists.
+  ///
+  /// The returned buffer is owned by the segment directory and must NOT be 
closed by the caller.
+  @Nullable
+  public static PinotDataBuffer 
getConsolidatedVectorEntry(SegmentDirectory.Reader reader, String column)
+      throws IOException {
+    try {
+      return reader.getIndexFor(column, StandardIndexes.vector());
+    } catch (RuntimeException e) {
+      return null;
+    }

Review Comment:
   `getConsolidatedVectorEntry` catches *all* `RuntimeException` from 
`reader.getIndexFor(...)` and converts it to `null`. In 
`SingleFileIndexDirectory`, `getIndexFor` throws `RuntimeException` not only 
for "missing typed entry" but also for corruption scenarios (e.g. missing magic 
marker). Swallowing those will make real corruption look like "absent entry" 
and can cause migration/crash-recovery to proceed on a broken segment without 
surfacing the underlying problem.



##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/index/vector/VectorIndexType.java:
##########
@@ -210,25 +211,81 @@ public VectorIndexReader 
createIndexReader(SegmentDirectory.Reader segmentReader
         return null;
       }
       VectorBackendType backendType = indexConfig.resolveBackendType();
-      File configuredIndexFile =
-          SegmentDirectoryPaths.findVectorIndexIndexFile(segmentDir, 
metadata.getColumnName(), indexConfig);
-      if (configuredIndexFile == null || !configuredIndexFile.exists()) {
-        LOGGER.warn("Skipping vector index reader for column: {} because 
configured backend {} does not have a "
-                + "matching on-disk artifact in segment: {}",
-            metadata.getColumnName(), backendType, segmentDir);
-        return null;
+      String column = metadata.getColumnName();
+
+      if (backendType == VectorBackendType.HNSW) {
+        // Combined form: load the HNSW index from the typed entry inside 
columns.psf when one
+        // actually exists. getConsolidatedVectorEntry (unlike hasIndexFor) 
does not mistake the
+        // legacy Lucene directory for a packed entry, so a segment whose 
handler has not yet
+        // migrated falls through to the legacy path below instead of failing 
the whole load. The
+        // buffer is owned by the segment directory — this reader must not 
close it.
+        if (indexConfig.isStoreInSegmentFile()) {
+          PinotDataBuffer buffer;
+          try {
+            buffer = 
VectorIndexUtils.getConsolidatedVectorEntry(segmentReader, column);
+          } catch (IOException e) {
+            throw new RuntimeException(
+                "Failed to read consolidated HNSW vector index from 
columns.psf for column: " + column, e);
+          }
+          if (buffer != null) {
+            return new HnswVectorIndexReader(column, buffer, 
metadata.getTotalDocs(), indexConfig);
+          }
+          LOGGER.warn("storeInSegmentFile=true but no consolidated HNSW entry 
found in columns.psf for column: {} "
+              + "in segment: {}; falling back to the on-disk Lucene 
directory", column, segmentDir);
+        }
+        // Legacy path: load the HNSW index from the Lucene directory on disk.
+        return new HnswVectorIndexReader(column, segmentDir, 
metadata.getTotalDocs(), indexConfig);
+      }
+
+      // IVF backends accept a PinotDataBuffer; that buffer either comes from 
the consolidated
+      // typed entry inside columns.psf (when storeInSegmentFile=true) or from 
the legacy combined
+      // file. The chosen reader takes ownership of the buffer and is 
responsible for closing it
+      // (including the constructor's own failure path).
+      PinotDataBuffer buffer;
+      if (indexConfig.isStoreInSegmentFile()) {
+        try {
+          buffer = VectorIndexUtils.getConsolidatedVectorEntry(segmentReader, 
column);
+        } catch (IOException e) {
+          throw new RuntimeException(
+              "Failed to read consolidated vector index from columns.psf for 
column: " + column, e);
+        }
+        if (buffer == null) {
+          LOGGER.warn("Skipping vector index reader for column: {} because 
storeInSegmentFile=true "
+              + "but no consolidated entry was found in columns.psf in 
segment: {}", column, segmentDir);
+          return null;
+        }
+      } else {
+        File configuredIndexFile = 
SegmentDirectoryPaths.findVectorIndexIndexFile(segmentDir, column, indexConfig);
+        if (configuredIndexFile == null || !configuredIndexFile.exists()) {
+          LOGGER.warn("Skipping vector index reader for column: {} because 
configured backend {} does not have a "
+              + "matching on-disk artifact in segment: {}", column, 
backendType, segmentDir);
+          return null;
+        }
+        buffer = IvfCombinedBuffers.mapCombinedFile(configuredIndexFile, 
column,
+            "vector-" + backendType.name().toLowerCase());
       }
 
+      // Buffer ownership: when reading from columns.psf, the segment 
directory owns the buffer
+      // and the reader must not close it. Combined mmap buffers are owned by 
the reader.
+      boolean ownsBuffer = !indexConfig.isStoreInSegmentFile();

Review Comment:
   After adding IVF fallback to an on-disk sidecar, buffer ownership should be 
derived from where the buffer came from (columns.psf vs mmap sidecar), not from 
the config flag alone. Otherwise a sidecar-mapped buffer used as a fallback 
would be treated as "borrowed" and leaked.



##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/index/vector/VectorIndexType.java:
##########
@@ -210,25 +211,81 @@ public VectorIndexReader 
createIndexReader(SegmentDirectory.Reader segmentReader
         return null;
       }
       VectorBackendType backendType = indexConfig.resolveBackendType();
-      File configuredIndexFile =
-          SegmentDirectoryPaths.findVectorIndexIndexFile(segmentDir, 
metadata.getColumnName(), indexConfig);
-      if (configuredIndexFile == null || !configuredIndexFile.exists()) {
-        LOGGER.warn("Skipping vector index reader for column: {} because 
configured backend {} does not have a "
-                + "matching on-disk artifact in segment: {}",
-            metadata.getColumnName(), backendType, segmentDir);
-        return null;
+      String column = metadata.getColumnName();
+
+      if (backendType == VectorBackendType.HNSW) {
+        // Combined form: load the HNSW index from the typed entry inside 
columns.psf when one
+        // actually exists. getConsolidatedVectorEntry (unlike hasIndexFor) 
does not mistake the
+        // legacy Lucene directory for a packed entry, so a segment whose 
handler has not yet
+        // migrated falls through to the legacy path below instead of failing 
the whole load. The
+        // buffer is owned by the segment directory — this reader must not 
close it.
+        if (indexConfig.isStoreInSegmentFile()) {
+          PinotDataBuffer buffer;
+          try {
+            buffer = 
VectorIndexUtils.getConsolidatedVectorEntry(segmentReader, column);
+          } catch (IOException e) {
+            throw new RuntimeException(
+                "Failed to read consolidated HNSW vector index from 
columns.psf for column: " + column, e);
+          }
+          if (buffer != null) {
+            return new HnswVectorIndexReader(column, buffer, 
metadata.getTotalDocs(), indexConfig);
+          }
+          LOGGER.warn("storeInSegmentFile=true but no consolidated HNSW entry 
found in columns.psf for column: {} "
+              + "in segment: {}; falling back to the on-disk Lucene 
directory", column, segmentDir);
+        }
+        // Legacy path: load the HNSW index from the Lucene directory on disk.
+        return new HnswVectorIndexReader(column, segmentDir, 
metadata.getTotalDocs(), indexConfig);
+      }
+
+      // IVF backends accept a PinotDataBuffer; that buffer either comes from 
the consolidated
+      // typed entry inside columns.psf (when storeInSegmentFile=true) or from 
the legacy combined
+      // file. The chosen reader takes ownership of the buffer and is 
responsible for closing it
+      // (including the constructor's own failure path).
+      PinotDataBuffer buffer;
+      if (indexConfig.isStoreInSegmentFile()) {
+        try {
+          buffer = VectorIndexUtils.getConsolidatedVectorEntry(segmentReader, 
column);
+        } catch (IOException e) {
+          throw new RuntimeException(
+              "Failed to read consolidated vector index from columns.psf for 
column: " + column, e);
+        }
+        if (buffer == null) {
+          LOGGER.warn("Skipping vector index reader for column: {} because 
storeInSegmentFile=true "
+              + "but no consolidated entry was found in columns.psf in 
segment: {}", column, segmentDir);
+          return null;
+        }
+      } else {
+        File configuredIndexFile = 
SegmentDirectoryPaths.findVectorIndexIndexFile(segmentDir, column, indexConfig);
+        if (configuredIndexFile == null || !configuredIndexFile.exists()) {
+          LOGGER.warn("Skipping vector index reader for column: {} because 
configured backend {} does not have a "
+              + "matching on-disk artifact in segment: {}", column, 
backendType, segmentDir);
+          return null;
+        }
+        buffer = IvfCombinedBuffers.mapCombinedFile(configuredIndexFile, 
column,
+            "vector-" + backendType.name().toLowerCase());
       }

Review Comment:
   For IVF backends, when `storeInSegmentFile=true` but the consolidated typed 
entry is missing (e.g. before/after a failed absorb), the reader factory 
returns null and silently disables the vector index even if a usable on-disk 
artifact exists. HNSW already falls back to legacy in this situation; IVF 
should do the same so the system can still use the index (and avoid unexpected 
exact-scan fallback) until migration completes.



##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/index/readers/vector/HnswVectorIndexReader.java:
##########
@@ -86,6 +91,36 @@ public HnswVectorIndexReader(String column, File indexDir, 
int numDocs, VectorIn
     }
   }
 
+  /**
+   * Buffer-backed constructor: reads the HNSW index from a combined {@link 
PinotDataBuffer}
+   * (the {@code LUCENE_V2} packed form produced by {@code 
HnswVectorIndexCombined}).
+   *
+   * <p>The buffer is <em>not</em> owned by this reader — closing this reader 
does not close the
+   * buffer. The buffer's lifetime must exceed this reader's lifetime; the 
segment directory is
+   * responsible for closing it.</p>
+   *
+   * @param column      column name
+   * @param indexBuffer combined buffer in LUCENE_V2 format; not owned by this 
reader
+   * @param numDocs     number of documents in the segment
+   * @param config      vector index configuration
+   */
+  public HnswVectorIndexReader(String column, PinotDataBuffer indexBuffer, int 
numDocs, VectorIndexConfig config) {
+    _column = column;
+    try {
+      _indexDirectory = 
HnswVectorIndexBufferReader.createLuceneDirectory(indexBuffer, column);
+      _indexReader = DirectoryReader.open(_indexDirectory);
+      _indexSearcher = new IndexSearcher(_indexReader);
+
+      // Try to extract the mapping from the packed buffer first; build from 
the Lucene index if absent.
+      PinotDataBuffer mappingBuffer = 
HnswVectorIndexBufferReader.extractDocIdMappingBuffer(indexBuffer, column);
+      _docIdTranslator = new DocIdTranslator(mappingBuffer, numDocs, 
_indexSearcher);
+    } catch (Exception e) {
+      LOGGER.error("Failed to instantiate buffer-backed HNSW index reader for 
column {}, exception {}", column,
+          e.getMessage());
+      throw new RuntimeException(e);
+    }

Review Comment:
   The new buffer-backed constructor logs only `e.getMessage()` and drops the 
throwable, which loses stack traces in server logs (making segment-load 
failures hard to debug). Log the exception itself via the SLF4J throwable 
parameter.



##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/creator/impl/vector/lucene99/HnswVectorIndexCombined.java:
##########
@@ -0,0 +1,347 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.pinot.segment.local.segment.creator.impl.vector.lucene99;
+
+import java.io.File;
+import java.io.IOException;
+import java.nio.ByteBuffer;
+import java.nio.ByteOrder;
+import java.nio.channels.FileChannel;
+import java.nio.file.StandardOpenOption;
+import java.util.Map;
+import java.util.TreeMap;
+import javax.annotation.Nullable;
+import org.apache.commons.io.FileUtils;
+import 
org.apache.pinot.segment.local.segment.creator.impl.text.LuceneCombinedTextIndexConstants;
+import org.apache.pinot.segment.spi.V1Constants;
+import org.apache.pinot.segment.spi.store.SegmentDirectoryPaths;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+
+/// Utility class to pack a Lucene HNSW index directory (and its optional 
docId mapping file) into
+/// a single combined file using the {@link 
LuceneCombinedTextIndexConstants#MAGIC_NUMBER LUCENE_V2}
+/// layout. Mirrors {@code LuceneTextIndexCombined} — reuses the same format 
constants to keep a
+/// single on-disk format shared across text and HNSW vector indexes.
+///
+/// Layout (identical to the text-index LUCENE_V2 format):
+/// ```
+/// [Header]
+///   Magic "LUCENE_V2"   9 bytes
+///   Version             4 bytes (little-endian int)
+///   Total buffer size   8 bytes (little-endian long)
+///   File count          4 bytes (little-endian int)
+///   Reserved            4 bytes
+///
+/// [File metadata, one entry per file]
+///   Name length         2 bytes (little-endian short)
+///   Name                variable
+///   File offset         8 bytes (little-endian long)
+///   File size           8 bytes (little-endian long)
+///
+/// [File data]
+///   Raw bytes of each file concatenated in metadata order
+/// ```
+public final class HnswVectorIndexCombined {
+  private static final Logger LOGGER = 
LoggerFactory.getLogger(HnswVectorIndexCombined.class);
+
+  private HnswVectorIndexCombined() {
+  }
+
+  /// Packs all files in {@code hnswIndexDir} (plus the optional docId mapping 
file) into a single
+  /// combined file at {@code outputFilePath}.
+  ///
+  /// @param hnswIndexDir     the Lucene HNSW index directory to pack
+  /// @param outputFilePath   destination path for the combined file
+  /// @param segmentIndexDir  when non-null, the segment's top-level index 
directory; used to
+  ///                         locate the docId mapping file for inclusion in 
the packed output
+  /// @param column           column name; used to locate the docId mapping 
file when
+  ///                         {@code segmentIndexDir} is provided
+  /// @throws IOException if any file operations fail
+  public static void combineHnswIndexFiles(File hnswIndexDir, String 
outputFilePath,
+      @Nullable File segmentIndexDir, @Nullable String column)
+      throws IOException {
+    if (!hnswIndexDir.exists() || !hnswIndexDir.isDirectory()) {
+      throw new IllegalArgumentException(
+          "HNSW index directory does not exist or is not a directory: " + 
hnswIndexDir);
+    }
+
+    LOGGER.info("Combining HNSW index files from directory: {}", 
hnswIndexDir.getAbsolutePath());
+
+    Map<String, FileInfo> fileInfoMap = collectFiles(hnswIndexDir, 
segmentIndexDir, column);
+    int fileCount = fileInfoMap.size();
+
+    if (fileCount == 0) {
+      throw new IOException("No files found in HNSW index directory: " + 
hnswIndexDir);
+    }
+
+    long totalSize = calculateTotalBufferSize(fileInfoMap);
+    if (totalSize > Integer.MAX_VALUE) {
+      throw new IOException("Combined HNSW index size too large: " + totalSize 
+ " bytes");
+    }
+
+    File outputFile = new File(outputFilePath);
+    // TRUNCATE_EXISTING so a leftover (larger) file from a previously-crashed 
pack does not leave
+    // stale trailing bytes past the new payload — that would inflate the file 
length and break the
+    // size-based crash-recovery check in VectorIndexHandler that compares the 
combined file length
+    // to the columns.psf typed-entry size.
+    try (FileChannel outputChannel = FileChannel.open(outputFile.toPath(), 
StandardOpenOption.CREATE,
+        StandardOpenOption.WRITE, StandardOpenOption.TRUNCATE_EXISTING)) {
+      writeHeader(outputChannel, fileCount, (int) totalSize);
+      long dataOffset = LuceneCombinedTextIndexConstants.getHeaderSize() + 
calculateMetadataSize(fileInfoMap);
+      writeFileMetadata(outputChannel, fileInfoMap, dataOffset);
+      writeFileData(outputChannel, fileInfoMap);
+    }
+
+    LOGGER.info("Combined {} HNSW index files into: {} ({} bytes)", fileCount, 
outputFilePath, totalSize);
+  }
+
+  /// Collects all regular files from {@code hnswIndexDir} and optionally the 
docId mapping file
+  /// from the segment's flat directory. Uses a {@link TreeMap} for 
deterministic ordering.
+  private static Map<String, FileInfo> collectFiles(File hnswIndexDir, 
@Nullable File segmentIndexDir,
+      @Nullable String column)
+      throws IOException {
+    Map<String, FileInfo> fileInfoMap = new TreeMap<>();
+
+    File[] files = hnswIndexDir.listFiles();
+    if (files != null) {
+      for (File file : files) {
+        if (file.isFile()) {
+          fileInfoMap.put(file.getName(), new FileInfo(file, file.getName(), 
file.length()));
+        }
+      }
+    }
+
+    // Include the docId mapping file when available. It lives beside the HNSW 
directory, not
+    // inside it (same convention as text-index).
+    if (segmentIndexDir != null && column != null) {
+      File segmentDir = 
SegmentDirectoryPaths.findSegmentDirectory(segmentIndexDir);
+      File mappingFile = new File(segmentDir,
+          column + 
V1Constants.Indexes.VECTOR_HNSW_INDEX_DOCID_MAPPING_FILE_EXTENSION);
+      if (mappingFile.exists() && mappingFile.isFile()) {
+        fileInfoMap.put(mappingFile.getName(), new FileInfo(mappingFile, 
mappingFile.getName(), mappingFile.length()));
+        LOGGER.info("Including docId mapping file: {} ({} bytes)", 
mappingFile.getName(), mappingFile.length());
+      }
+    }
+
+    return fileInfoMap;
+  }
+
+  /// Extracts files packed inside a combined HNSW file back into a Lucene 
directory.
+  ///
+  /// This is the inverse of {@link #combineHnswIndexFiles}. The docId mapping 
file (if present)
+  /// is extracted into {@code targetDir} alongside the Lucene index files; 
the caller is
+  /// responsible for moving it to its canonical location if needed.
+  ///
+  /// @param combinedFile source combined file (LUCENE_V2 layout)
+  /// @param targetDir    destination directory; created if absent
+  /// @throws IOException if any file operations fail
+  public static void extractHnswIndexFiles(File combinedFile, File targetDir)
+      throws IOException {
+    if (!combinedFile.exists() || !combinedFile.isFile()) {
+      throw new IllegalArgumentException("Combined file does not exist or is 
not a file: " + combinedFile);
+    }
+    FileUtils.forceMkdir(targetDir);
+
+    try (FileChannel inputChannel = FileChannel.open(combinedFile.toPath(), 
StandardOpenOption.READ)) {
+      // Parse header
+      byte[] magicBytes = new 
byte[LuceneCombinedTextIndexConstants.MAGIC_NUMBER_LENGTH];
+      readFully(inputChannel, ByteBuffer.wrap(magicBytes));
+      String magic = new String(magicBytes);
+      if (!LuceneCombinedTextIndexConstants.MAGIC_NUMBER.equals(magic)) {
+        throw new IOException("Invalid magic number in combined HNSW file: " + 
magic);
+      }
+
+      ByteBuffer intBuf = 
ByteBuffer.allocate(Integer.BYTES).order(ByteOrder.LITTLE_ENDIAN);
+      readFully(inputChannel, intBuf);
+      intBuf.flip();
+      int version = intBuf.getInt();
+      if (version != LuceneCombinedTextIndexConstants.VERSION) {
+        throw new IOException("Unsupported version in combined HNSW file: " + 
version);
+      }
+
+      // Skip total size (8 bytes) and file count (4 bytes) header fields
+      ByteBuffer longBuf = 
ByteBuffer.allocate(Long.BYTES).order(ByteOrder.LITTLE_ENDIAN);
+      readFully(inputChannel, longBuf); // totalSize
+      longBuf.flip();
+      // (unused, but we advance the channel position past it)
+
+      intBuf = 
ByteBuffer.allocate(Integer.BYTES).order(ByteOrder.LITTLE_ENDIAN);
+      readFully(inputChannel, intBuf);
+      intBuf.flip();
+      int fileCount = intBuf.getInt();
+
+      // Skip reserved field
+      intBuf = 
ByteBuffer.allocate(Integer.BYTES).order(ByteOrder.LITTLE_ENDIAN);
+      readFully(inputChannel, intBuf);
+
+      // Parse file metadata
+      String[] fileNames = new String[fileCount];
+      long[] fileOffsets = new long[fileCount];
+      long[] fileSizes = new long[fileCount];
+      for (int i = 0; i < fileCount; i++) {
+        ByteBuffer shortBuf = 
ByteBuffer.allocate(Short.BYTES).order(ByteOrder.LITTLE_ENDIAN);
+        readFully(inputChannel, shortBuf);
+        shortBuf.flip();
+        short nameLength = shortBuf.getShort();
+
+        byte[] nameBytes = new byte[nameLength];
+        readFully(inputChannel, ByteBuffer.wrap(nameBytes));
+        fileNames[i] = new String(nameBytes);
+
+        longBuf = 
ByteBuffer.allocate(Long.BYTES).order(ByteOrder.LITTLE_ENDIAN);
+        readFully(inputChannel, longBuf);
+        longBuf.flip();
+        fileOffsets[i] = longBuf.getLong();
+
+        longBuf = 
ByteBuffer.allocate(Long.BYTES).order(ByteOrder.LITTLE_ENDIAN);
+        readFully(inputChannel, longBuf);
+        longBuf.flip();
+        fileSizes[i] = longBuf.getLong();
+      }
+
+      // Extract each file by seeking to its offset and copying bytes
+      for (int i = 0; i < fileCount; i++) {
+        File outFile = new File(targetDir, fileNames[i]);
+        long fileSize = fileSizes[i];
+        try (FileChannel outChannel = FileChannel.open(outFile.toPath(), 
StandardOpenOption.CREATE,
+            StandardOpenOption.WRITE)) {
+          long remaining = fileSize;
+          long srcOffset = fileOffsets[i];
+          while (remaining > 0) {
+            long transferred = inputChannel.transferTo(srcOffset, remaining, 
outChannel);
+            srcOffset += transferred;
+            remaining -= transferred;
+          }

Review Comment:
   `extractHnswIndexFiles` loops on `FileChannel.transferTo` without a progress 
check. If the combined file is corrupt (bad offsets) or `transferTo` returns 0, 
this can become an infinite loop during segment load/migration.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Consolidate IVF and HNSW vector indexes into segment columns.psf (opt-in via storeInSegmentFile) [pinot]

Reply via email to