The second thing I want to look at doing is replacing queued state update operations with local CAS loops for state format v2 collections, with in-process collection-level mutex to ensure that a node isn't contending with itself. This would only be for state updates, anything more complex would still go to overseer.
Then at least if a Solr node gets kill -9'd it immediately stops hitting ZK instead of leaving a bunch of garbage in the queue. This would require some changes in ZKStateWriter's assumptions. On Wed, Nov 23, 2016 at 6:59 PM, Scott Blum <[email protected]> wrote: > On Wed, Nov 23, 2016 at 5:45 PM, Mark Miller <[email protected]> > wrote: > >> One thing is, when you reconnect after connecting to ZK, it should now >> efficiently set every core as down in a single command, not each core. > > > Yeah, I backported downnode, but it still actually takes a long time for > overseer to execute, and there can be a bunch of these in the queue for the > same node. > > On Wed, Nov 23, 2016 at 5:53 PM, Mark Miller <[email protected]> > wrote: > >> In many cases other nodes need to see a progression of state changes. You >> really have to clear the deck and try to start from 0. > > > This is exactly the kind of detail I'm looking for. Can you elaborate? > > Unless we can come up with a better idea, my first experiment will be to > try to eliminate the "DOWN" replica state in all practical cases, relying > only on careful management of live_nodes presence. For example, the > startup sequence (or reconnect sequence) would skip marking replicas down > and just ensure they're ACTIVE or else put them into RECOVERING, join shard > leader elections, and finally join live_nodes when that's done. > > What land mines am I likely to run into or existing assumptions am I > likely to violate if I do that? >
