mikemccand commented on issue #14758:
URL: https://github.com/apache/lucene/issues/14758#issuecomment-3253547064

   > * This proposal has an added side-benefit of de-duplicating vectors 
_within_ a field as well (if the features used for vector generation are 
identical across two documents)
   
   This is really interesting!  It reminds me of [ZFS's block-level 
deduping](https://www.truenas.com/docs/references/zfsdeduplication/), and how 
annoyingly and weirdly effective it often is (because the files I store have 
surprising redundancy / I forget that I made N copies / whatever).
   
   Hmmm must the deduping be perfect (guaranteed to dedup all dups)?  What if 
it only did so within one document, which would enable this "compile KNN 
prefilter to separate field's HNSW graph during indexing" efficiently?  But not 
across documents.  It might be a baby step with less added indexing cost since 
you wouldn't need a global sort order for vectors, merging wouldn't need to 
check for dups across segments, etc.?  Within one document we could even do a 
quick `==` check to see if the identical `float[]` or `byte[]` was passed to 
multiple fields (common case, hopefully)?
   
   > Multiple ordinals can point to the same vector, so we need to maintain an 
additional mapping of ordinal -> position of vector in raw data.
   
   Hmm, why would we need to also know the position?  Oh I see -- `fieldA` will 
refer to the dedup'd vector with a different ordinal than `fieldB` because 
ordinals must be compact (0..N) within each field?  This might be another 
(small?) benefit of only dedup within one document: this ordinalA -> ordinalB 
mapping would be monotonic, and compress well 
(`PackedLongValues.monotonicBuilder`), like the `DocMap` we use during merging 
to map around deletions, or map after merge sort (for the statically sorted 
indexes).
   
   Maybe this could be done with "ordinal to ordinal" mapping, just within the 
dup'd field?  User would add `fieldA = new KnnFloatVectorField(...)` and then 
`fieldB = fieldA.newLinkedField("name")`, so `fieldB` knows its sharing from 
`fieldA`'s vector.  `fieldA` would build a normal HNSW graph like we have today 
(no ord remapping, positions, etc.), but `fieldB` would record an additional 
ord -> ord map (maybe bimap?).
   
   > We'll need some codec-level plumbing to ensure that the same instance (not 
just class!) of the raw format is shared by all HNSW formats
   
   Isn't this how the default Codec (currently `Lucene103Codec`) works?  It's 
per-field HNSW format, but by default all fields use the same instance of 
`KnnVectorsWriter` when writing a segment?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to