I've been trying to port synapse[1] to use the server state file for seamless reloads, but I'm having some trouble. It seems the state file is essentially ignored since any change in a backend's server ordering invalidates the state for the entire backend (since the server puids change, even if the server id [name] stays constant). Synapse shuffles backends on every write of the configuration to ensure that different client machines have different starting servers (e.g. for long lived connections), which naturally changes puids, but even if it had a fixed order such as sorting, whenever a server is added or removed, the puids shift and potentially none of the state is transferred across the reload.
I guess this comes back to the id nomenclature. From what I can tell, the server struct defines two id like fields: (id, puid), which in most of the server.c code are referred to as (name, id) = (id, puid). Somewhat confusingly id is actually puid, which is not actually unique, it's just the order of servers in a backend and I assume exists because ids (names) might be duplicated. If we set the server id to a proper identifier (e.g. unique addr+port+user supplied string), then the apply server state function ignores state whenever the server count changes because the puids don't match. If we set the server id [name] to a constant set of identifiers (e.g. srv1-srvN), then the apply server state function will set things like healthcheck state to a totally unrelated server, which also seems bad. Either way, the server state doesn't work as I would hope. I feel like I sorta expect the server id (name) to actually be a unique identifier, but that gets us back to the question of how to dynamically update it (I think you suggested adding another identifier, the external identifier, but now that I know there is already a puid confusingly called an id and an id called the name, I'm a little concerned about adding a third identifier). What do you guys think? When using show servers state is it best practice to set your server names to be actual unique identifiers (in which case the puid check in the apply server state should probably go away right) or positional identifiers (in which case how do we prevent carry over of healthcheck state between unrelated servers, maybe we should disable state loading if we know the servers have moved around)? Do we need a third id? Thanks! -Joey [1] https://github.com/airbnb/synapse

