HNSW graph `connectComponents()` method takes a very long on random vectors

Viliam Ďurina Wed, 27 Nov 2024 05:55:44 -0800

Hi all,

I'm experimenting with parallel HNSW graph building, however I'm
experiencing very long execution times in the `connectComponents` method
when using random vectors. When using SIFT1M dataset, it does not occur.


I'm adding 100k documents with FLOAT vectors. Dimensionality is 128. With
the SIFT dataset, I'm using the first 100k vectors. I'm using 4 threads.
The execution time for random vectors is around 239s. When using the SIFT
vectors, the execution time is around 4s. That's a huge difference. My CPU
is Core Ultra 9 185H.

Below is a reproducer for random vectors. The code adds 100k vectors,
committing after each 10k vectors. This creates 10 segments, which are
merged into 1 segment when `forceMerge` is called, with 4 workers. I'm
using Lucene 10.0 on Java 21:


public class ConcurrentMergeTest {
    private static final Random RANDOM = new Random(42);

    public static void main(String[] args) throws Exception {
        Path path = Files.createTempDirectory("conc-merge-test");
        Directory directory = new MMapDirectory(path);

        InfoStream.setDefault(new PrintStreamInfoStream(System.out));
        IndexWriterConfig config = new IndexWriterConfig();
        try (ExecutorService executor = Executors.newFixedThreadPool(4)) {
            config.setCodec(new FilterCodec(Codec.getDefault().getName(),
Codec.getDefault()) {
                @Override
                public KnnVectorsFormat knnVectorsFormat() {
                    return new Lucene99HnswVectorsFormat(DEFAULT_MAX_CONN,
DEFAULT_BEAM_WIDTH, 4, executor);
                }
            });
            long start = System.nanoTime();

            try (IndexWriter writer = new IndexWriter(directory, config)) {
                int batchSize = 10_000;
                int numVectors = 100_000;
                for (int i = 0; i < numVectors; i++) {
                    float[] vec = randomVector(128);
                    Document doc = new Document();
                    doc.add(new KnnFloatVectorField("vector", vec,
VectorSimilarityFunction.DOT_PRODUCT));
                    writer.addDocument(doc);

                    if ((i + 1) % batchSize == 0) { // commit after each
batch
                        writer.commit();
                    }
                }

                System.out.println("Merging all segments...");
                writer.forceMerge(1);
                writer.commit();
                System.out.println("Merge completed!");
            }

            System.out.println("Elapsed time: " +
NANOSECONDS.toMillis(System.nanoTime() - start));
        }

        // Verify the index
        try (DirectoryReader reader = DirectoryReader.open(directory)) {
            System.out.println("Number of documents in the index: " +
reader.numDocs());
        }
    }

    private static float[] randomVector(int dim) {
        float[] v = new float[dim];
        for (int i = 0; i < dim; i++) {
            v[i] = RANDOM.nextInt(219);
        }
        return v;
    }
}



The debug output includes the following, not sure if that's a problem or
not. But no exception is reported to my code:
HNSW 0 [2024-11-27T13:20:12.852323560Z; Lucene Merge Thread #0]:
connectComponents failed on level 1
HNSW 0 [2024-11-27T13:20:19.833511543Z; Lucene Merge Thread #0]:
connectComponents failed on level 2
HNSW 0 [2024-11-27T13:20:19.833621748Z; Lucene Merge Thread #0]:
connectComponents 230758 ms

I'm not attaching the code to use the SIFT dataset as it is a lot longer,
parsing the files. Anybody has any ideas? I'm attaching the whole debug
output.

Thanks,
Viliam

<<attachment: debug_output.zip>>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

HNSW graph `connectComponents()` method takes a very long on random vectors

Reply via email to