mikemccand commented on issue #14758: URL: https://github.com/apache/lucene/issues/14758#issuecomment-3253547064
> * This proposal has an added side-benefit of de-duplicating vectors _within_ a field as well (if the features used for vector generation are identical across two documents) This is really interesting! It reminds me of [ZFS's block-level deduping](https://www.truenas.com/docs/references/zfsdeduplication/), and how annoyingly and weirdly effective it often is (because the files I store have surprising redundancy / I forget that I made N copies / whatever). Hmmm must the deduping be perfect (guaranteed to dedup all dups)? What if it only did so within one document, which would enable this "compile KNN prefilter to separate field's HNSW graph during indexing" efficiently? But not across documents. It might be a baby step with less added indexing cost since you wouldn't need a global sort order for vectors, merging wouldn't need to check for dups across segments, etc.? Within one document we could even do a quick `==` check to see if the identical `float[]` or `byte[]` was passed to multiple fields (common case, hopefully)? > Multiple ordinals can point to the same vector, so we need to maintain an additional mapping of ordinal -> position of vector in raw data. Hmm, why would we need to also know the position? Oh I see -- `fieldA` will refer to the dedup'd vector with a different ordinal than `fieldB` because ordinals must be compact (0..N) within each field? This might be another (small?) benefit of only dedup within one document: this ordinalA -> ordinalB mapping would be monotonic, and compress well (`PackedLongValues.monotonicBuilder`), like the `DocMap` we use during merging to map around deletions, or map after merge sort (for the statically sorted indexes). Maybe this could be done with "ordinal to ordinal" mapping, just within the dup'd field? User would add `fieldA = new KnnFloatVectorField(...)` and then `fieldB = fieldA.newLinkedField("name")`, so `fieldB` knows its sharing from `fieldA`'s vector. `fieldA` would build a normal HNSW graph like we have today (no ord remapping, positions, etc.), but `fieldB` would record an additional ord -> ord map (maybe bimap?). > We'll need some codec-level plumbing to ensure that the same instance (not just class!) of the raw format is shared by all HNSW formats Isn't this how the default Codec (currently `Lucene103Codec`) works? It's per-field HNSW format, but by default all fields use the same instance of `KnnVectorsWriter` when writing a segment? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org