[GitHub] [lucene] jmazanec15 commented on issue #12342: Prevent VectorSimilarity.DOT_PRODUCT from returning negative scores

via GitHub Thu, 20 Jul 2023 08:47:59 -0700


jmazanec15 commented on issue #12342:
URL: https://github.com/apache/lucene/issues/12342#issuecomment-1644167264


   I ran an initial experiment. It appears that recall without the 
pre-processing is very high (**99.1**) compared 
   to with the pre-processing (**87.4**), when mimicking one of the experiments 
from https://blog.vespa.ai/announcing-maximum-inner-product-search/.
   
   That being said, @benwtrent would you be able to double check my experiment 
setup to ensure I didn't overlook something?
   
   ## Experiment
   
   Their experiment used the following data:
   * data set from 
https://huggingface.co/datasets/Cohere/wikipedia-22-12-simple-embeddings 
(485,859 docs)
   * 400k 768-dimensional vectors (first 400k of the data set)
   * 10k queries (last 10k of the data set)
   
   And used the following config:
   * exploreAdditionalHits = 190
   * k = 10
   * max-links-per-node = 48
   * neighbors-to-explore-at-insert = 200
   * insert order random
   
   For this, they reported a recall@10 of **87.4**
   
   I used luceneutil and set the following parameters:
   * single segment
   * maxConn: 48
   * beamWidthIndex: 200 
   * fanout: 200
   * topK: 10
   * metric: 'angular'
   
   I got a recall@10 of **99.1**:
   ```
   $ time python src/python/knnPerfTest.py
   WARNING: Gnuplot module not present; will not make charts
   lucene
   {'ndoc': (400000,), 'maxConn': (48,), 'beamWidthIndex': (200,), 'fanout': 
(200,), 'topK': (10,)}
   
/home/ec2-user/candidate/lucene/lucene/core/build/libs/lucene-core-10.0.0-SNAPSHOT.jar:/home/ec2-user/candidate/lucene/lucene/sandbox/build/classes/java/main:/home/ec2-user/candidate/lucene/lucene/misc/build/classes/java/main:/home/ec2-user/candidate/lucene/lucene/facet/build/classes/java/main:/home/ec2-user/candidate/lucene/lucene/analysis/common/build/classes/java/main:/home/ec2-user/candidate/lucene/lucene/analysis/icu/build/classes/java/main:/home/ec2-user/candidate/lucene/lucene/queryparser/build/classes/java/main:/home/ec2-user/candidate/lucene/lucene/grouping/build/classes/java/main:/home/ec2-user/candidate/lucene/lucene/suggest/build/classes/java/main:/home/ec2-user/candidate/lucene/lucene/highlighter/build/classes/java/main:/home/ec2-user/candidate/lucene/lucene/codecs/build/classes/java/main:/home/ec2-user/candidate/lucene/lucene/queries/build/classes/java/main:/home/ec2-user/.gradle/caches/modules-2/files-2.1/com.carrotsearch/hppc/0.9.1/4bf4c51e06aec600894d841c4c004566b
 
20dd357/hppc-0.9.1.jar:/home/ec2-user/candidate/luceneutil/lib/HdrHistogram.jar:/home/ec2-user/candidate/luceneutil/build:/home/ec2-user/candidate/luceneutil/src/main
   recall  latency nDoc    fanout  maxConn beamWidth       visited index ms
   ['java', '-cp', 
'/home/ec2-user/candidate/lucene/lucene/core/build/libs/lucene-core-10.0.0-SNAPSHOT.jar:/home/ec2-user/candidate/lucene/lucene/sandbox/build/classes/java/main:/home/ec2-user/candidate/lucene/lucene/misc/build/classes/java/main:/home/ec2-user/candidate/lucene/lucene/facet/build/classes/java/main:/home/ec2-user/candidate/lucene/lucene/analysis/common/build/classes/java/main:/home/ec2-user/candidate/lucene/lucene/analysis/icu/build/classes/java/main:/home/ec2-user/candidate/lucene/lucene/queryparser/build/classes/java/main:/home/ec2-user/candidate/lucene/lucene/grouping/build/classes/java/main:/home/ec2-user/candidate/lucene/lucene/suggest/build/classes/java/main:/home/ec2-user/candidate/lucene/lucene/highlighter/build/classes/java/main:/home/ec2-user/candidate/lucene/lucene/codecs/build/classes/java/main:/home/ec2-user/candidate/lucene/lucene/queries/build/classes/java/main:/home/ec2-user/.gradle/caches/modules-2/files-2.1/com.carrotsearch/hppc/0.9.1/4bf4c51e06aec600
 
894d841c4c004566b20dd357/hppc-0.9.1.jar:/home/ec2-user/candidate/luceneutil/lib/HdrHistogram.jar:/home/ec2-user/candidate/luceneutil/build:/home/ec2-user/candidate/luceneutil/src/main',
 '--add-modules', 'jdk.incubator.vector', 
'-Dorg.apache.lucene.store.MMapDirectory.enableMemorySegments=false', 
'KnnGraphTester', '-ndoc', '400000', '-maxConn', '48', '-beamWidthIndex', 
'200', '-fanout', '200', '-topK', '10', '-dim', '768', '-docs', 
'/home/ec2-user/data-prep/wiki768.train', '-reindex', '-search', 
'/home/ec2-user/data-prep/wiki768.test', '-metric', 'angular', '-quiet']
   WARNING: Using incubator modules: jdk.incubator.vector
   
   0.991    6.98   400000  200     48      200     210     1913700 1.00    
post-filter
   
   real    45m3.266s
   user    40m8.451s
   sys     4m54.290s
   ```
   
   <details>
    <summary> <h2>Dataset setup details</h2></summary>
     
   I pulled the data sets are parquet files from 
https://huggingface.co/datasets/Cohere/wikipedia-22-12-simple-embeddings/tree/main/data:
   ```
   curl -LO 
https://huggingface.co/datasets/Cohere/wikipedia-22-12-simple-embeddings/resolve/main/data/train-00000-of-00004-1a1932c9ca1c7152.parquet
   curl -LO 
https://huggingface.co/datasets/Cohere/wikipedia-22-12-simple-embeddings/resolve/main/data/train-00001-of-00004-f4a4f5540ade14b4.parquet
   curl -LO 
https://huggingface.co/datasets/Cohere/wikipedia-22-12-simple-embeddings/resolve/main/data/train-00002-of-00004-ff770df3ab420d14.parquet
   curl -LO 
https://huggingface.co/datasets/Cohere/wikipedia-22-12-simple-embeddings/resolve/main/data/train-00003-of-00004-85b3dbbc960e92ec.parquet
   
   ```
   
   I ran the following to translate it into the data set that could be used by 
lucene util:
   (pip install numpy pyarrow)
   ```
   import numpy as np
   import pyarrow.parquet as pq
   
   tb1 = pq.read_table("train-00000-of-00004-1a1932c9ca1c7152.parquet", 
columns=['emb'])
   tb2 = pq.read_table("train-00001-of-00004-f4a4f5540ade14b4.parquet", 
columns=['emb'])
   tb3 = pq.read_table("train-00002-of-00004-ff770df3ab420d14.parquet", 
columns=['emb'])
   tb4 = pq.read_table("train-00003-of-00004-85b3dbbc960e92ec.parquet", 
columns=['emb'])
   
   np1 = tb1[0].to_numpy()
   np2 = tb2[0].to_numpy()
   np3 = tb3[0].to_numpy()
   np4 = tb4[0].to_numpy()
   
   np_total = np.concatenate((np1, np2, np3, np4))
   
   # Have to convert to a list here to get 
   # the numpy ndarray's shape correct later
   # There's probably a better way...
   flat_ds = list()
   for vec in np_total:
        flat_ds.append(vec)
   
   np_flat_ds = np.array(flat_ds)
   
   # Shape is (485859, 768) and dtype is float32
   np_flat_ds
   
   with open("wiki768.train", "w") as out_f:
        np_flat_ds[0:400000].tofile(out_f)
   
   with open("wiki768.test", "w") as out_f:
        np_flat_ds[475858:-1].tofile(out_f)
   ```
   
   I then modified the KnnPerfTool.py to use this data set and set above 
parameters and ran the test.
   
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] jmazanec15 commented on issue #12342: Prevent VectorSimilarity.DOT_PRODUCT from returning negative scores

Reply via email to