mikemccand commented on issue #13158: URL: https://github.com/apache/lucene/issues/13158#issuecomment-3820323014
> As read-only indexes are mostly optimized to one segment anyways, maybe add a MergePolicy that can do this somehow when doing a forceMerge? I think read-only isn't the right word -- you're thinking of the kind of indices you would burn & ship on CD so people could search the docs of installed software right :) Like "dead end index never again changed". For this feature, the index is still writable / actively changing, but when you replicate it, the replica becomes read-only (replicas will never be promoted to primary and suddenly start writing into their copy). Maybe a "read-only mirror/replica" is better term? Each new segment gets replicated to the read-only replica during NRT segment replication. So this lossy process (discarding full precision vectors) is only done once, per segment, during replication, and IndexWriter never opens these replicated segments (only `IndexReader`). Our hacky implementation today (during replication when we see a `.vec/q` file, we also write another file with 0 vectors instead) is quite messy because we are stuffing it into the index w/o Lucene's knowledge (cannot use Lucene's `incRef`/`decRef` for lifecycle, to delete them on time), we are recreating Codec-specific code (bad abstraction violation, since that's impl detail of the Codec), we have hacks to copy over the right segment GUID, match the Codec header/footer, write checksum, etc. -- badly copying all the stuff the Codec is already doing. So the idea we iterated to here is to instead expect KNN quantizer (the quantizing `FlatVectorsWriter`) to also write another file, empty `.veq`. You would only use this empty `.veq` file if you want a "read only replica", then you would replicate this empty file instead of the full file. Since the Codec then owns this file, it's properly tracked in `.files()`, deleted when segment is deleted, etc. I *think* it should be a tiny change (compared to internal hack we have so far...), but I haven't looked at #15630 yet to see... This design is not perfect -- most Lucene users won't use these empty `.veq` files (though they are tiny, and never opened unless you make read-only replica). But the impact is _massive_ for our (Amazon's Customer product search) usage -- 5X reduction in index size / replication bandwidth / storage / time for vector-heavy indices (because for `int8` we are storing both `float32` and `int8` (five bytes) per dimension, and with this fix it's one byte). So I hope we can find a clean / simple way to bring this option to all Lucene users using NRT segment replication ... -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
