mikemccand commented on issue #13158:
URL: https://github.com/apache/lucene/issues/13158#issuecomment-3820323014

   > As read-only indexes are mostly optimized to one segment anyways, maybe 
add a MergePolicy that can do this somehow when doing a forceMerge?
   
   I think read-only isn't the right word -- you're thinking of the kind of 
indices you would burn & ship on CD so people could search the docs of 
installed software right :)  Like "dead end index never again changed".
   
   For this feature, the index is still writable / actively changing, but when 
you replicate it, the replica becomes read-only (replicas will never be 
promoted to primary and suddenly start writing into their copy).
   
   Maybe a "read-only mirror/replica" is better term?  Each new segment gets 
replicated to the read-only replica during NRT segment replication.
   
   So this lossy process (discarding full precision vectors) is only done once, 
per segment, during replication, and IndexWriter never opens these replicated 
segments (only `IndexReader`).
   
   Our hacky implementation today (during replication when we see a `.vec/q` 
file, we also write another file with 0 vectors instead) is quite messy because 
we are stuffing it into the index w/o Lucene's knowledge (cannot use Lucene's 
`incRef`/`decRef` for lifecycle, to delete them on time), we are recreating 
Codec-specific code (bad abstraction violation, since that's impl detail of the 
Codec), we have hacks to copy over the right segment GUID, match the Codec 
header/footer, write checksum, etc. -- badly copying all the stuff the Codec is 
already doing.
   
   So the idea we iterated to here is to instead expect KNN quantizer (the 
quantizing `FlatVectorsWriter`) to also write another file, empty `.veq`.  You 
would only use this empty `.veq` file if you want a "read only replica", then 
you would replicate this empty file instead of the full file.  Since the Codec 
then owns this file, it's properly tracked in `.files()`, deleted when segment 
is deleted, etc.
   
   I *think* it should be a tiny change (compared to internal hack we have so 
far...), but I haven't looked at #15630 yet to see...
   
   This design is not perfect -- most Lucene users won't use these empty `.veq` 
files (though they are tiny, and never opened unless you make read-only 
replica).
   
   But the impact is _massive_ for our (Amazon's Customer product search) usage 
-- 5X reduction in index size / replication bandwidth / storage / time for 
vector-heavy indices (because for `int8` we are storing both `float32` and 
`int8` (five bytes) per dimension, and with this fix it's one byte).
   
   So I hope we can find a clean / simple way to bring this option to all 
Lucene users using NRT segment replication ...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to