Hi Lucene friends, We use the replicator module to implement log-shipping replication for our Lucene cluster. We have an offline "rebuild everything" process for use when indexing or data formats change.
We have a single primary node that only serves the IndexWriter and replicator api, and then the replicas handle user queries. This offline rebuild produces a new index which we then "atomic swap" in over the primary data (taking care to preserve the generation counter) by restarting the primary node. However, despite the replicas noticing that the generation counter incremented, they refuse to accept updates from the primary since the NRT version is less. Example: 211270.116s 181.3s: syncing R1726737503 [search-commit-0] top: commit primaryGen=153 infos=segments_2lz: _1oml(10.1.0):C3859289/782993:[... 211270.162s 181.4s: syncing R1726737503 [search-commit-0] top: commit decRef lastCommitFiles=[_24og_Lucene101_0.tmd, ..., _24og.fnm] 211270.162s 181.4s: syncing R1726737503 [search-commit-0] now delete 1 files: [segments_2lz] 211270.163s 181.4s: syncing R1726737503 [search-commit-0] top: commit version=308804 files now [_24og_Lucene101_0.tmd, ..., _24og.fnm] 211275.387s 186.6s: syncing R1726737503 [index-update-0] top: start sync sis.version=241 211275.386s 186.6s: syncing R1726737503 [index-update-0] top: delete if no ref pendingMergeFiles=[] 211275.386s 186.6s: syncing R1726737503 [index-update-0] top: now change lastPrimaryGen from 153 to 154 pendingMergeFiles=[] 211275.387s 186.6s: syncing R1726737503 [index-update-0] top: new NRT point (version=241) is older than current (version=308804); skipping You can see the old generation, 153, consider its final segments_2lz. Then, the new generation, 154 comes online with a reset NRT counter (241). Since this is less than the old NRT counter, 308804, the replica never updates until we push 300k updates through our pipeline, then it "snaps" back into place and starts working. If a new generation of indexer comes up, should the replica forget its NRT counter? (Maybe this is a safety mechanism to avoid losing newer data for older data from a stale replica?) Or is there some other mechanism we are missing to reset this counter? We know we load new data that is discontinuous with the old history, and want it to replace all current data (and delete the old data). The best I've come up so far is to detect this situation in maybeNewPrimary and fake out getCurrentSearchingVersion, but that feels hacky at best. Thanks for any advice here! Steven --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org