jmazanec15 commented on issue #12342: URL: https://github.com/apache/lucene/issues/12342#issuecomment-1644167264
I ran an initial experiment. It appears that recall without the pre-processing is very high (**99.1**) compared to with the pre-processing (**87.4**), when mimicking one of the experiments from https://blog.vespa.ai/announcing-maximum-inner-product-search/. That being said, @benwtrent would you be able to double check my experiment setup to ensure I didn't overlook something? ## Experiment Their experiment used the following data: * data set from https://huggingface.co/datasets/Cohere/wikipedia-22-12-simple-embeddings (485,859 docs) * 400k 768-dimensional vectors (first 400k of the data set) * 10k queries (last 10k of the data set) And used the following config: * exploreAdditionalHits = 190 * k = 10 * max-links-per-node = 48 * neighbors-to-explore-at-insert = 200 * insert order random For this, they reported a recall@10 of **87.4** I used luceneutil and set the following parameters: * single segment * maxConn: 48 * beamWidthIndex: 200 * fanout: 200 * topK: 10 * metric: 'angular' I got a recall@10 of **99.1**: ``` $ time python src/python/knnPerfTest.py WARNING: Gnuplot module not present; will not make charts lucene {'ndoc': (400000,), 'maxConn': (48,), 'beamWidthIndex': (200,), 'fanout': (200,), 'topK': (10,)} /home/ec2-user/candidate/lucene/lucene/core/build/libs/lucene-core-10.0.0-SNAPSHOT.jar:/home/ec2-user/candidate/lucene/lucene/sandbox/build/classes/java/main:/home/ec2-user/candidate/lucene/lucene/misc/build/classes/java/main:/home/ec2-user/candidate/lucene/lucene/facet/build/classes/java/main:/home/ec2-user/candidate/lucene/lucene/analysis/common/build/classes/java/main:/home/ec2-user/candidate/lucene/lucene/analysis/icu/build/classes/java/main:/home/ec2-user/candidate/lucene/lucene/queryparser/build/classes/java/main:/home/ec2-user/candidate/lucene/lucene/grouping/build/classes/java/main:/home/ec2-user/candidate/lucene/lucene/suggest/build/classes/java/main:/home/ec2-user/candidate/lucene/lucene/highlighter/build/classes/java/main:/home/ec2-user/candidate/lucene/lucene/codecs/build/classes/java/main:/home/ec2-user/candidate/lucene/lucene/queries/build/classes/java/main:/home/ec2-user/.gradle/caches/modules-2/files-2.1/com.carrotsearch/hppc/0.9.1/4bf4c51e06aec600894d841c4c004566b 20dd357/hppc-0.9.1.jar:/home/ec2-user/candidate/luceneutil/lib/HdrHistogram.jar:/home/ec2-user/candidate/luceneutil/build:/home/ec2-user/candidate/luceneutil/src/main recall latency nDoc fanout maxConn beamWidth visited index ms ['java', '-cp', '/home/ec2-user/candidate/lucene/lucene/core/build/libs/lucene-core-10.0.0-SNAPSHOT.jar:/home/ec2-user/candidate/lucene/lucene/sandbox/build/classes/java/main:/home/ec2-user/candidate/lucene/lucene/misc/build/classes/java/main:/home/ec2-user/candidate/lucene/lucene/facet/build/classes/java/main:/home/ec2-user/candidate/lucene/lucene/analysis/common/build/classes/java/main:/home/ec2-user/candidate/lucene/lucene/analysis/icu/build/classes/java/main:/home/ec2-user/candidate/lucene/lucene/queryparser/build/classes/java/main:/home/ec2-user/candidate/lucene/lucene/grouping/build/classes/java/main:/home/ec2-user/candidate/lucene/lucene/suggest/build/classes/java/main:/home/ec2-user/candidate/lucene/lucene/highlighter/build/classes/java/main:/home/ec2-user/candidate/lucene/lucene/codecs/build/classes/java/main:/home/ec2-user/candidate/lucene/lucene/queries/build/classes/java/main:/home/ec2-user/.gradle/caches/modules-2/files-2.1/com.carrotsearch/hppc/0.9.1/4bf4c51e06aec600 894d841c4c004566b20dd357/hppc-0.9.1.jar:/home/ec2-user/candidate/luceneutil/lib/HdrHistogram.jar:/home/ec2-user/candidate/luceneutil/build:/home/ec2-user/candidate/luceneutil/src/main', '--add-modules', 'jdk.incubator.vector', '-Dorg.apache.lucene.store.MMapDirectory.enableMemorySegments=false', 'KnnGraphTester', '-ndoc', '400000', '-maxConn', '48', '-beamWidthIndex', '200', '-fanout', '200', '-topK', '10', '-dim', '768', '-docs', '/home/ec2-user/data-prep/wiki768.train', '-reindex', '-search', '/home/ec2-user/data-prep/wiki768.test', '-metric', 'angular', '-quiet'] WARNING: Using incubator modules: jdk.incubator.vector 0.991 6.98 400000 200 48 200 210 1913700 1.00 post-filter real 45m3.266s user 40m8.451s sys 4m54.290s ``` <details> <summary> <h2>Dataset setup details</h2></summary> I pulled the data sets are parquet files from https://huggingface.co/datasets/Cohere/wikipedia-22-12-simple-embeddings/tree/main/data: ``` curl -LO https://huggingface.co/datasets/Cohere/wikipedia-22-12-simple-embeddings/resolve/main/data/train-00000-of-00004-1a1932c9ca1c7152.parquet curl -LO https://huggingface.co/datasets/Cohere/wikipedia-22-12-simple-embeddings/resolve/main/data/train-00001-of-00004-f4a4f5540ade14b4.parquet curl -LO https://huggingface.co/datasets/Cohere/wikipedia-22-12-simple-embeddings/resolve/main/data/train-00002-of-00004-ff770df3ab420d14.parquet curl -LO https://huggingface.co/datasets/Cohere/wikipedia-22-12-simple-embeddings/resolve/main/data/train-00003-of-00004-85b3dbbc960e92ec.parquet ``` I ran the following to translate it into the data set that could be used by lucene util: (pip install numpy pyarrow) ``` import numpy as np import pyarrow.parquet as pq tb1 = pq.read_table("train-00000-of-00004-1a1932c9ca1c7152.parquet", columns=['emb']) tb2 = pq.read_table("train-00001-of-00004-f4a4f5540ade14b4.parquet", columns=['emb']) tb3 = pq.read_table("train-00002-of-00004-ff770df3ab420d14.parquet", columns=['emb']) tb4 = pq.read_table("train-00003-of-00004-85b3dbbc960e92ec.parquet", columns=['emb']) np1 = tb1[0].to_numpy() np2 = tb2[0].to_numpy() np3 = tb3[0].to_numpy() np4 = tb4[0].to_numpy() np_total = np.concatenate((np1, np2, np3, np4)) # Have to convert to a list here to get # the numpy ndarray's shape correct later # There's probably a better way... flat_ds = list() for vec in np_total: flat_ds.append(vec) np_flat_ds = np.array(flat_ds) # Shape is (485859, 768) and dtype is float32 np_flat_ds with open("wiki768.train", "w") as out_f: np_flat_ds[0:400000].tofile(out_f) with open("wiki768.test", "w") as out_f: np_flat_ds[475858:-1].tofile(out_f) ``` I then modified the KnnPerfTool.py to use this data set and set above parameters and ran the test. </details> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org