Re: [DISCUSS] Gossip shutdown may corrupt peers making it so the cluster never converges, and a small protocol change to fix
Brandon and I have been talking in CASSANDRA-18913 and here is the current plan; sharing for visibility There are two bugs: 1) restart and seeing a shutdown event before gossip has settled for you will create a partial EndpointState which leads to failed startup 2) shutdown corrupts state due to every node modifying their local copy (this may impact host replacements) For #1 this can be fixed without changing networking, so will create a small patch for 4.x line For #2 it does require a network change, so will add this to 5.0 with mixed mode support > On Oct 6, 2023, at 4:18 PM, David Capwell wrote: > >> Won't the replacement have a newer generation? > > The replacement is a different instance. I performs a shadow round with its > seeds and if they are impacted by this issue then they are missing tokens, so > we fail the host replacement… you can work around this by changing the seeds > to nodes that know the token. > >> I don't think it is, this is just fixing a gossip bug, and we should do so >> in all affected versions. > > Right now we sent NoPayload which is 0 bytes, but with the change we send > GossipShutdown which contains the whole EndpointState… I “feel” like 4.x can > not handle this but worth a test (after deserializing the message we have > extra bytes… won’t we get mad about that?)… the gossip handler doesn’t look > at the payload so as long as 4.x serialization can support this, then it > won’t be hard to back port to 4.x > > >> On Oct 6, 2023, at 3:57 PM, Brandon Williams wrote: >> >> On Fri, Oct 6, 2023 at 5:50 PM David Capwell wrote: >>> Lets say you now need to host replace node1 >> >> Won't the replacement have a newer generation? >> >>> avoid peers mutating endpoint states they don’t own >> >> This sounds reasonable to me. >> >>> This would be a protocol change, so would need to make sure everyone is >>> cool with me doing this in 5.0.> >> >> I don't think it is, this is just fixing a gossip bug, and we should >> do so in all affected versions. >> >> Kind Regards, >> Brandon >
Re: [DISCUSS] Gossip shutdown may corrupt peers making it so the cluster never converges, and a small protocol change to fix
> Won't the replacement have a newer generation? The replacement is a different instance. I performs a shadow round with its seeds and if they are impacted by this issue then they are missing tokens, so we fail the host replacement… you can work around this by changing the seeds to nodes that know the token. > I don't think it is, this is just fixing a gossip bug, and we should do so in > all affected versions. Right now we sent NoPayload which is 0 bytes, but with the change we send GossipShutdown which contains the whole EndpointState… I “feel” like 4.x can not handle this but worth a test (after deserializing the message we have extra bytes… won’t we get mad about that?)… the gossip handler doesn’t look at the payload so as long as 4.x serialization can support this, then it won’t be hard to back port to 4.x > On Oct 6, 2023, at 3:57 PM, Brandon Williams wrote: > > On Fri, Oct 6, 2023 at 5:50 PM David Capwell wrote: >> Lets say you now need to host replace node1 > > Won't the replacement have a newer generation? > >> avoid peers mutating endpoint states they don’t own > > This sounds reasonable to me. > >> This would be a protocol change, so would need to make sure everyone is cool >> with me doing this in 5.0.> > > I don't think it is, this is just fixing a gossip bug, and we should > do so in all affected versions. > > Kind Regards, > Brandon
Re: [DISCUSS] Gossip shutdown may corrupt peers making it so the cluster never converges, and a small protocol change to fix
On Fri, Oct 6, 2023 at 5:50 PM David Capwell wrote: > Lets say you now need to host replace node1 Won't the replacement have a newer generation? > avoid peers mutating endpoint states they don’t own This sounds reasonable to me. > This would be a protocol change, so would need to make sure everyone is cool > with me doing this in 5.0.> I don't think it is, this is just fixing a gossip bug, and we should do so in all affected versions. Kind Regards, Brandon