msokolov commented on PR #926:
URL: https://github.com/apache/lucene/pull/926#issuecomment-1164418508

   Hi Alessandro, thank you for running the tests. I'm suspicious of the 
results though -- they just look too good to be true! I know from profiling 
that we spend most of the time in similarity computations, yet this change 
doesn't impact how many of those we do nor how costly they are.
   
   One thing I see is that you are using an `hdf5` file as input, but this 
tester was not designed to accept that format. This is a script I have used to 
extract raw floating-point data (what KnnGraphTester expects) from hdf5. This 
also takes care of normalizing to unit vectors, which you should do for angular 
data, but nor euclidean
   
   ```
   import h5py
   import numpy as np
   import sys
   
   with h5py.File(sys.argv[1], 'r') as f:
       for key in f.keys():
           print(f"{key}: {f[key].shape}")
           ds = f[key]
           print(f"copying {ds.shape} from {key}")
           arr = np.zeros(ds.shape, dtype='float32')
           ds.read_direct(arr)
   
           # normalize all vectors (along dim 1) to unit length
           norm = np.linalg.norm(arr, 2, 1)
           norm[norm==0] = 1
           arr = arr / np.expand_dims(norm, 1)
   
           arr.tofile(sys.argv[1] + "-" + key)
   ```
    


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to