lucene-replicator: how to correctly reset NRT version

Steven Schlansker Fri, 21 Feb 2025 13:11:38 -0800

Hi Lucene friends,

We use the replicator module to implement log-shipping replication for our 
Lucene cluster.
We have an offline "rebuild everything" process for use when indexing or data 
formats change.


We have a single primary node that only serves the IndexWriter and replicator 
api, and then the replicas handle user queries.
This offline rebuild produces a new index which we then "atomic swap" in over 
the primary data (taking care to preserve the generation counter) by restarting 
the primary node.

However, despite the replicas noticing that the generation counter incremented, 
they refuse to accept updates from the primary since the NRT version is less.

Example:
211270.116s 181.3s: syncing R1726737503 [search-commit-0] top: commit 
primaryGen=153 infos=segments_2lz: _1oml(10.1.0):C3859289/782993:[...
211270.162s 181.4s: syncing R1726737503 [search-commit-0] top: commit decRef 
lastCommitFiles=[_24og_Lucene101_0.tmd, ..., _24og.fnm]
211270.162s 181.4s: syncing R1726737503 [search-commit-0] now delete 1 files: 
[segments_2lz]
211270.163s 181.4s: syncing R1726737503 [search-commit-0] top: commit 
version=308804 files now [_24og_Lucene101_0.tmd, ..., _24og.fnm]
211275.387s 186.6s: syncing R1726737503 [index-update-0] top: start sync 
sis.version=241
211275.386s 186.6s: syncing R1726737503 [index-update-0] top: delete if no ref 
pendingMergeFiles=[]
211275.386s 186.6s: syncing R1726737503 [index-update-0] top: now change 
lastPrimaryGen from 153 to 154 pendingMergeFiles=[]
211275.387s 186.6s: syncing R1726737503 [index-update-0] top: new NRT point 
(version=241) is older than current (version=308804); skipping

You can see the old generation, 153, consider its final segments_2lz.
Then, the new generation, 154 comes online with a reset NRT counter (241).
Since this is less than the old NRT counter, 308804, the replica never updates
until we push 300k updates through our pipeline, then it "snaps" back into 
place and starts working.

If a new generation of indexer comes up, should the replica forget its NRT 
counter? (Maybe this is a safety mechanism to avoid losing newer data for older 
data from a stale replica?)

Or is there some other mechanism we are missing to reset this counter? We know 
we load new data that is discontinuous with the old history, and want it to 
replace
all current data (and delete the old data).

The best I've come up so far is to detect this situation in maybeNewPrimary and 
fake out getCurrentSearchingVersion, but that feels hacky at best.

Thanks for any advice here!
Steven


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

lucene-replicator: how to correctly reset NRT version

Reply via email to