xrobau opened a new issue, #4621: URL: https://github.com/apache/couchdb/issues/4621
## Description See PR #3901 which changed instance_start_time from a fixed zero, to the correct utime, and is mentioned in [the 3.3 changelog](https://docs.couchdb.org/en/stable/whatsnew/3.3.html) Unfortunately, this missed the case where the client replicators see it changing from zero to a non-zero number. In our example, we have a core couch cluster of 5 nodes, which are currently running 3.2.2-2, with a frontend haproxy that distributes requests. After we upgraded our first node, we started getting sporadic crashing alerts from clients, with the error ```` state": "crashing", "info": { "error": "instance_start_time on source database has changed since last checkpoint." }, ```` This is a legitimate error. The previous checkpoint had zero, the current checkpoint had the correct timestamp, so the replication correctly assumed that something had broken and refused to continue. This ALSO lead to the extremely unpleasant situation where a cluster of machines were returning different data for the same endpoint, depending on the version of couch used: ```` root@mke2-fs1a:~# curl -s http://admin:admin@10.60.1.63:5984/bigcalls | jq | grep start "instance_start_time": "0" root@mke2-fs1a:~# curl -s http://admin:admin@10.60.1.62:5984/bigcalls | jq | grep start "instance_start_time": "1590120633", root@mke2-fs1a:~# ```` ## Suggestion As this is going to need the CLIENT checkpoint to be updated when the start time changes from zero to non-zero, which should not trigger a full resync, my gut feeling is that the upgrade process would have to be something like this 1. Couch team adds the code to handle a zero-to-non-zero update to 3.3 as a normal and expected thing, that should not cause an error 2. Couch team adds a 'report_start_time_as_zero' flag somewhere, so that all the nodes will continue to report zero, even when upgraded to 3.3 3. We then dd that flag to our ansible playbook, deploy it, and then upgrade all the nodes to 3.3. At this point, we'd be running 3.3.3 (or whatever the version would be), but everything would still be returning zero as the start time 4. We remove the flag from core, which would then update the checkpoints on the first level of clients (without triggering a resync) 5. We remove the flag from every other node, which would then let the 0->int change propogate through all the client levels This is probably a bit of an edge case for our specific use case as our 'bigcalls' database has several hundred million records, is replicated globally, and does take about a week to bootstrap - so redoing all our replications is not something we look forward to! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: notifications-unsubscr...@couchdb.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org