I think it would be prudent to emphasize in the release notes that rolling upgrades (and mixed ensembles generally) are effectively untested. That this was, in practice, a non-goal of this release cycle. Because if we can get to rc2 without noticing a showstopper, clearly it's not something that anyone has gotten around to attempting; and there have to be a hundred corner cases beyond the MultiAddress issue.
On Tue, Feb 11, 2020 at 12:27 PM Szalay-Bekő Máté < [email protected]> wrote: > I see the main problem here in the fact that we are missing proper > versioning in the leader election / quorum protocols. I tried to simply > implement backward compatibility in 3.6, but it didn't solve the problem. > The new code understands the old protocol, but it can not decide when to > use the new or the old protocol during connection initiation. So the old > servers can not read the new init messages and we still temporarly end up > having two partitions during rolling restart. > > I already suggested two ways to handle this later, but I think for 3.6.0 > now the simplest solution is to disable the new MultiAddress feature and > stick to the old protocol version by default. Plus extend the > documentation with the note, that enabling the MultiAddress feature is not > possible during a rolling upgrade, but it needs to be done with a separate > rolling restart. With this approach, the rolling restart should "just work" > with the 3.4 / 3.5 configs and we don't require any extra step / > configuration from the users, unless they want to use the new feature. I > plan to submit a PR with these changes tomorrow to ZOOKEEPER-3720, if there > isn't any different opinion. > > P.S. For 4.0 we might need to put some extra thinking into backward > compatibility / versioning for the quorum and client protocols. > > > On Tue, Feb 11, 2020, 20:44 Michael K. Edwards <[email protected]> > wrote: > >> I hate to say it, but I think 3.6.0 should release as is. It is >> impossible >> to *reliably* retrofit backwards compatibility / interoperability onto a >> release that was engineered from the beginning without that goal. Learn >> the lesson, set goals differently in the future. >> >> On Tue, Feb 11, 2020 at 9:41 AM Szalay-Bekő Máté < >> [email protected]> >> wrote: >> >> > FYI: I created these scripts for my local tests: >> > https://github.com/symat/zk-rolling-upgrade-test >> > >> > For the long term I would also add some script that actually monitors >> the >> > state of the quorum and also runs continuous traffic, not just 1-2 >> > smoketests after each restart. But I don't know how important this would >> > be. >> > >> > On Tue, Feb 11, 2020 at 5:25 PM Enrico Olivelli <[email protected]> >> > wrote: >> > >> > > Il giorno mar 11 feb 2020 alle ore 17:17 Andor Molnar >> > > <[email protected]> ha scritto: >> > > > >> > > > The most obvious one which crosses my mind is that I previously >> worked >> > > on: >> > > > >> > > > 1) run old version cluster, >> > > > 2) connect to each node and run smoke tests, >> > > > 3) restart one node with new code, >> > > > 4) goto 2) until all nodes are upgraded >> > > > >> > > > I think this wouldn’t work in a “unit test”, we probably need a >> > separate >> > > Jenkins job and a nice python script to do this. >> > > > >> > > > Andor >> > > > >> > > > >> > > > >> > > > >> > > > > On 2020. Feb 11., at 16:38, Patrick Hunt <[email protected]> >> wrote: >> > > > > >> > > > > Anyone have ideas how we could add testing for upgrade? Obviously >> > > something >> > > > > we're missing, esp given it's import. >> > > >> > > I will send an email next days with a proposal. >> > > btw my idea is very like Andor's one >> > > >> > > Once we have an automatic environment we can launch from Jenkins >> > > >> > > Enrico >> > > >> > > >> > > > > >> > > > > Patrick >> > > > > >> > > > > On Tue, Feb 11, 2020 at 12:40 AM Enrico Olivelli < >> > [email protected]> >> > > > > wrote: >> > > > > >> > > > >> Il giorno mar 11 feb 2020 alle ore 09:12 Szalay-Bekő Máté >> > > > >> <[email protected]> ha scritto: >> > > > >>> >> > > > >>> Hi All, >> > > > >>> >> > > > >>> about the question from Michael: >> > > > >>>> Regarding the fix, can we just make 3.6.0 aware of the old >> > protocol >> > > and >> > > > >>>> speak old message format when it's talking to old server? >> > > > >>> >> > > > >>> In this particular case, it might be enough. The protocol change >> > > happened >> > > > >>> now in the 'initial message' sent by the QuorumCnxManager. >> Maybe it >> > > is >> > > > >> not >> > > > >>> a problem if the new servers can not initiate channels to the >> old >> > > > >> servers, >> > > > >>> maybe it is enough if these channel gets initiated by the old >> > servers >> > > > >> only. >> > > > >>> I will test it quickly. >> > > > >>> >> > > > >>> Although I have no idea if any other thing changed in the quorum >> > > protocol >> > > > >>> between 3.5 and 3.6. In other cases it might not be enough if >> the >> > new >> > > > >>> servers can understand the old messages, as the old servers can >> > > break by >> > > > >>> not understanding the messages from the new servers. Also, in >> the >> > > code >> > > > >>> currently (AFAIK) there is no generic knowledge of protocol >> > > versions, the >> > > > >>> servers are not storing that which protocol versions they >> > can/should >> > > use >> > > > >> to >> > > > >>> communicate to which particular other servers. Maybe we don't >> even >> > > need >> > > > >>> this, but I would feel better if we would have more tests around >> > > these >> > > > >>> things. >> > > > >>> >> > > > >>> My suggestion for the long term: >> > > > >>> - let's fix this particular issue now with 3.6.0 quickly (I >> start >> > > doing >> > > > >>> this today) >> > > > >>> - let's do some automation (backed up with jenkins) that will >> test >> > a >> > > > >> whole >> > > > >>> combinations of different ZooKeeper upgrade paths by making >> rolling >> > > > >>> upgrades during some light traffic. Let's have a bit better >> > > definition >> > > > >>> about what we expect (e.g. the quorum is up, but some clients >> can >> > get >> > > > >>> disconnected? What will happen to the ephemeral nodes? Do we >> want >> > to >> > > > >>> gracefully close or transfer the user sessions before stopping >> the >> > > old >> > > > >>> server?) and let's see where this broke. Just by checking the >> > code, I >> > > > >> don't >> > > > >>> think the quorum will always be up (e.g. between older 3.4 >> versions >> > > and >> > > > >>> 3.5). >> > > > >> >> > > > >> >> > > > >> I am happy to work on this topic >> > > > >> >> > > > >>> - we need to update the Wiki about the working rolling upgrade >> > paths >> > > and >> > > > >>> maybe about workarounds if needed >> > > > >>> - we might need to do some fixes (adding backward compatible >> > versions >> > > > >>> and/or specific parameters that enforce old protocol temporary >> > > during the >> > > > >>> rolling upgrade that can be changed later to the new protocol by >> > > either >> > > > >>> dynamic reconfig or by rolling restart) >> > > > >> >> > > > >> it would be much better on 3.6 code to have some support for >> > > > >> compatibility with 3.5 servers >> > > > >> we can't require old code to be forward compatible but we can >> make >> > new >> > > > >> code be compatible to a certain extend with old code. >> > > > >> If we can achieve this compatibility goal without a flag is >> better, >> > > > >> users won't have to care about this part and they simply "trust" >> on >> > us >> > > > >> >> > > > >> The rollback story is also important, but maybe we are still not >> > ready >> > > > >> for it, in case of local changes to store, >> > > > >> it is better to have a clear design and plan and work for a new >> > > release >> > > > >> (3.7?) >> > > > >> >> > > > >> Enrico >> > > > >> >> > > > >>> >> > > > >>> Depending on your comments, I am happy to create a few Jira >> tickets >> > > > >> around >> > > > >>> these topics. >> > > > >>> >> > > > >>> Kind regards, >> > > > >>> Mate >> > > > >>> >> > > > >>> ps. Enrico, sorry about your RC... I owe you a beer, let me >> know if >> > > you >> > > > >> are >> > > > >>> near to Budapest ;) >> > > > >>> >> > > > >>> On Tue, Feb 11, 2020 at 8:43 AM Enrico Olivelli < >> > [email protected] >> > > > >> > > > >> wrote: >> > > > >>> >> > > > >>>> Good. >> > > > >>>> >> > > > >>>> I will cancel the vote for 3.6.0rc2. >> > > > >>>> >> > > > >>>> I appreciate very much If Mate and his colleagues have time to >> > work >> > > on >> > > > >> a >> > > > >>>> fix. >> > > > >>>> Otherwise I will have cycles next week >> > > > >>>> >> > > > >>>> I would also like to spend my time in setting up a few minimal >> > > > >> integration >> > > > >>>> tests about the upgrade story >> > > > >>>> >> > > > >>>> Enrico >> > > > >>>> >> > > > >>>> Il Mar 11 Feb 2020, 07:30 Michael Han <[email protected]> ha >> > scritto: >> > > > >>>> >> > > > >>>>> Kudos Enrico, very thorough work as the final gate keeper of >> the >> > > > >> release! >> > > > >>>>> >> > > > >>>>> Now with this, I'd like to *vote a -1* on the 3.6.0 RC2. >> > > > >>>>> >> > > > >>>>> I'd recommend we fix this issue for 3.6.0. ZooKeeper is one of >> > the >> > > > >> rare >> > > > >>>>> piece of software that put so much emphasis on compatibilities >> > thus >> > > > >> it >> > > > >>>> just >> > > > >>>>> works when upgrade / downgrade, which is amazing. One >> guarantee >> > we >> > > > >> always >> > > > >>>>> had is during rolling upgrade, the quorum will always be >> > available, >> > > > >>>> leading >> > > > >>>>> to no service interruption. It would be sad we lose such >> > capability >> > > > >> given >> > > > >>>>> this is still a tractable problem. >> > > > >>>>> >> > > > >>>>> Regarding the fix, can we just make 3.6.0 aware of the old >> > protocol >> > > > >> and >> > > > >>>>> speak old message format when it's talking to old server? >> > > Basically, >> > > > >> an >> > > > >>>>> ugly if else check against the protocol version should work >> and >> > > > >> there is >> > > > >>>> no >> > > > >>>>> need to have multiple pass on rolling upgrade process. >> > > > >>>>> >> > > > >>>>> >> > > > >>>>> On Mon, Feb 10, 2020 at 10:23 PM Enrico Olivelli < >> > > > >> [email protected]> >> > > > >>>>> wrote: >> > > > >>>>> >> > > > >>>>>> I suggest this plan: >> > > > >>>>>> - release 3.6.0 now >> > > > >>>>>> - improve the migration story, the flow outlined by Mate is >> > > > >>>>>> interesting, but it will take time >> > > > >>>>>> >> > > > >>>>>> 3.6.0rc2 got enough binding votes so I am going to finalize >> the >> > > > >>>>>> release this evening (within 8-10 hours) if no one comes out >> in >> > > the >> > > > >>>>>> VOTE thread with a -1 >> > > > >>>>>> >> > > > >>>>>> Enrico >> > > > >>>>>> >> > > > >>>>>> Enrico >> > > > >>>>>> >> > > > >>>>>> Il giorno lun 10 feb 2020 alle ore 19:33 Patrick Hunt >> > > > >>>>>> <[email protected]> ha scritto: >> > > > >>>>>>> >> > > > >>>>>>> On Mon, Feb 10, 2020 at 3:38 AM Andor Molnar < >> [email protected] >> > > >> > > > >>>> wrote: >> > > > >>>>>>> >> > > > >>>>>>>> Hi, >> > > > >>>>>>>> >> > > > >>>>>>>> Answers inline. >> > > > >>>>>>>> >> > > > >>>>>>>> >> > > > >>>>>>>>> In my experience when you are close to a release it is >> > > > >> better to >> > > > >>>> to >> > > > >>>>>>>>> make big changes. (I am among the approvers of that patch, >> > > > >> so I >> > > > >>>> am >> > > > >>>>>>>>> responsible for this change) >> > > > >>>>>>>> >> > > > >>>>>>>> >> > > > >>>>>>>> >> > > > >>>>>>>> Although this statement is acceptable for me, I don’t feel >> > this >> > > > >>>> patch >> > > > >>>>>>>> should not have been merged into 3.6.0. Submission has been >> > > > >>>> preceded >> > > > >>>>>> by a >> > > > >>>>>>>> long argument with MAPR folks who originally wanted to be >> > > > >> merged >> > > > >>>> into >> > > > >>>>>> 3.4 >> > > > >>>>>>>> branch (considering the pace how ZooKeeper community is >> moving >> > > > >>>>>> forward) and >> > > > >>>>>>>> we reached an agreement that release it with 3.6.0. >> > > > >>>>>>>> >> > > > >>>>>>>> Make a long story short, this patch has been outstanding >> for >> > > > >> ages >> > > > >>>>>> without >> > > > >>>>>>>> much attention from the community and contributors made a >> lot >> > > > >> of >> > > > >>>>>> effort to >> > > > >>>>>>>> get it done before the release. >> > > > >>>>>>>> >> > > > >>>>>>>> >> > > > >>>>>>>>> I would like to ear from people that have been in the >> > > > >> community >> > > > >>>> for >> > > > >>>>>>>>> long time, then I am ready to complete the release process >> > > > >> for >> > > > >>>>>>>>> 3.6.0rc2. >> > > > >>>>>>>> >> > > > >>>>>>>> >> > > > >>>>>>>> Me too. >> > > > >>>>>>>> >> > > > >>>>>>>> I tend to accept the way rolling restart works now - as you >> > > > >>>> described >> > > > >>>>>>>> Enrico - and given that situation was pretty much the same >> > > > >> between >> > > > >>>>> 3.4 >> > > > >>>>>> and >> > > > >>>>>>>> 3.5, I don’t feel we have to make additional changes. >> > > > >>>>>>>> >> > > > >>>>>>>> On the other hand, the fix that Mate suggested sounds quite >> > > > >> cool, >> > > > >>>> I’m >> > > > >>>>>> also >> > > > >>>>>>>> happy to work on getting it in. >> > > > >>>>>>>> >> > > > >>>>>>>> Fyi, Release Management page says the following: >> > > > >>>>>>>> >> > > > >>>>>> >> > > > >>>> >> > > > >> >> > > >> https://cwiki.apache.org/confluence/display/ZOOKEEPER/ReleaseManagement >> > > > >>>>>>>> >> > > > >>>>>>>> "major.minor release of ZooKeeper must be backwards >> compatible >> > > > >> with >> > > > >>>>> the >> > > > >>>>>>>> previous minor release, major.(minor-1)" >> > > > >>>>>>>> >> > > > >>>>>>>> >> > > > >>>>>>> Our users, direct and indirect, value the ability to >> migrate to >> > > > >> newer >> > > > >>>>>>> versions - esp as we drop support for older. Frictions such >> as >> > > > >> this >> > > > >>>> can >> > > > >>>>>> be >> > > > >>>>>>> a reason to go elsewhere. I'm "pro" b/w compact - esp given >> our >> > > > >>>>> published >> > > > >>>>>>> guidelines. >> > > > >>>>>>> >> > > > >>>>>>> Patrick >> > > > >>>>>>> >> > > > >>>>>>> >> > > > >>>>>>>> Andor >> > > > >>>>>>>> >> > > > >>>>>>>> >> > > > >>>>>>>> >> > > > >>>>>>>> >> > > > >>>>>>>>> On 2020. Feb 10., at 11:32, Enrico Olivelli < >> > > > >> [email protected] >> > > > >>>>> >> > > > >>>>>> wrote: >> > > > >>>>>>>>> >> > > > >>>>>>>>> Thank you Mate for checking and explaining this story. >> > > > >>>>>>>>> >> > > > >>>>>>>>> I find it very interesting that the cause is >> ZOOKEEPER-3188 >> > > > >> as: >> > > > >>>>>>>>> - it is the last "big patch" committed to 3.6 before >> > > > >> starting the >> > > > >>>>>>>>> release process >> > > > >>>>>>>>> - it is the cause of the failure of the first RC >> > > > >>>>>>>>> >> > > > >>>>>>>>> In my experience when you are close to a release it is >> > > > >> better to >> > > > >>>> to >> > > > >>>>>>>>> make big changes. (I am among the approvers of that patch, >> > > > >> so I >> > > > >>>> am >> > > > >>>>>>>>> responsible for this change) >> > > > >>>>>>>>> >> > > > >>>>>>>>> This is a pointer to the change to whom who wants to >> > > > >> understand >> > > > >>>>>> better >> > > > >>>>>>>>> the context >> > > > >>>>>>>>> >> > > > >>>>>>>> >> > > > >>>>>> >> > > > >>>>> >> > > > >>>> >> > > > >> >> > > >> > >> https://github.com/apache/zookeeper/pull/1048/files#diff-7a209d890686bcba351d758b64b22a7dR11 >> > > > >>>>>>>>> >> > > > >>>>>>>>> IIUC even for the upgrade from 3.4 to 3.5 the story was >> the >> > > > >> same >> > > > >>>>> and >> > > > >>>>>>>>> if this statement holds then I feel we can continue >> > > > >>>>>>>>> with this release. >> > > > >>>>>>>>> >> > > > >>>>>>>>> - Reverting ZOOKEEPER-3188 is not an option for me, it is >> too >> > > > >>>>>> complex. >> > > > >>>>>>>>> - Making 3.5 and 3.6 "compatible" can be very tricky and >> we >> > > > >> do >> > > > >>>> not >> > > > >>>>>>>>> have tools to certify this compatibility (at least not in >> the >> > > > >>>> short >> > > > >>>>>>>>> term) >> > > > >>>>>>>>> >> > > > >>>>>>>>> I would like to ear from people that have been in the >> > > > >> community >> > > > >>>> for >> > > > >>>>>>>>> long time, then I am ready to complete the release process >> > > > >> for >> > > > >>>>>>>>> 3.6.0rc2. >> > > > >>>>>>>>> >> > > > >>>>>>>>> I will update the website and the release notes with a >> > > > >> specific >> > > > >>>>>>>>> warning about the upgrade, we should also update the Wiki >> > > > >>>>>>>>> >> > > > >>>>>>>>> Enrico >> > > > >>>>>>>>> >> > > > >>>>>>>>> >> > > > >>>>>>>>> Il giorno lun 10 feb 2020 alle ore 11:17 Szalay-Bekő Máté >> > > > >>>>>>>>> <[email protected]> ha scritto: >> > > > >>>>>>>>>> >> > > > >>>>>>>>>> Hi Enrico! >> > > > >>>>>>>>>> >> > > > >>>>>>>>>> This is caused by the different PROTOCOL_VERSION in the >> > > > >>>>>>>> QuorumCnxManager. >> > > > >>>>>>>>>> The Protocol version was changed last time in >> > > > >> ZOOKEEPER-2186 >> > > > >>>>>> released >> > > > >>>>>>>>>> first in 3.4.7 and 3.5.1 to avoid some crashing / fix >> some >> > > > >> bugs. >> > > > >>>>>> Later I >> > > > >>>>>>>>>> also changed the protocol version when the format of the >> > > > >> initial >> > > > >>>>>> message >> > > > >>>>>>>>>> changed in ZOOKEEPER-3188. So actually the quorum >> protocol >> > > > >> is >> > > > >>>> not >> > > > >>>>>>>>>> compatible in this case and is the 'expected' behavior if >> > > > >> you >> > > > >>>>>> upgrade >> > > > >>>>>>>> e.g >> > > > >>>>>>>>>> from 3.4.6 to 3.4.7, or 3.4.6 to 3.5.5 or e.g from 3.5.6 >> to >> > > > >>>> 3.6.0. >> > > > >>>>>>>>>> >> > > > >>>>>>>>>> We had some discussion in the PR of ZOOKEEPER-3188 back >> > > > >> then and >> > > > >>>>>> got to >> > > > >>>>>>>> the >> > > > >>>>>>>>>> conclusion that it is not that bad, as there will be no >> data >> > > > >>>> loss >> > > > >>>>>> as you >> > > > >>>>>>>>>> wrote. The tricky thing is that during rolling upgrade we >> > > > >> should >> > > > >>>>>> ensure >> > > > >>>>>>>>>> both backward and forward compatibility to make sure that >> > > > >> the >> > > > >>>> old >> > > > >>>>>> and >> > > > >>>>>>>> the >> > > > >>>>>>>>>> new part of the quorum can still speak to each other. The >> > > > >>>> current >> > > > >>>>>>>> solution >> > > > >>>>>>>>>> (simply failing if the protocol versions mismatch) is >> more >> > > > >>>> simple >> > > > >>>>>> and >> > > > >>>>>>>> still >> > > > >>>>>>>>>> working just fine: as the servers are restarted >> one-by-one, >> > > > >> the >> > > > >>>>>> nodes >> > > > >>>>>>>> with >> > > > >>>>>>>>>> the old protocol version and the nodes with the new >> protocol >> > > > >>>>> version >> > > > >>>>>>>> will >> > > > >>>>>>>>>> form two partitions, but any given time only one >> partition >> > > > >> will >> > > > >>>>>> have the >> > > > >>>>>>>>>> quorum. >> > > > >>>>>>>>>> >> > > > >>>>>>>>>> Still, thinking it trough, as a side effect in these >> cases >> > > > >> there >> > > > >>>>>> will >> > > > >>>>>>>> be a >> > > > >>>>>>>>>> short time when none of the partitions will have quorums >> > > > >> (when >> > > > >>>> we >> > > > >>>>>> have N >> > > > >>>>>>>>>> servers with the old protocol version, N servers with the >> > > > >> new >> > > > >>>>>> protocol >> > > > >>>>>>>>>> version, and there is one server just being restarted). I >> > > > >> am not >> > > > >>>>>> sure >> > > > >>>>>>>> if we >> > > > >>>>>>>>>> can accept this. >> > > > >>>>>>>>>> >> > > > >>>>>>>>>> For ZOOKEEPER-3188 we can add a small patch to make it >> > > > >> possible >> > > > >>>> to >> > > > >>>>>> parse >> > > > >>>>>>>>>> the initial message of the old protocol version with the >> new >> > > > >>>> code. >> > > > >>>>>> But >> > > > >>>>>>>> I am >> > > > >>>>>>>>>> not sure if it would be enough (as the old code will not >> be >> > > > >> able >> > > > >>>>> to >> > > > >>>>>>>> parse >> > > > >>>>>>>>>> the new initial message). >> > > > >>>>>>>>>> >> > > > >>>>>>>>>> One option can be to make a patch also for 3.5 to have a >> > > > >> version >> > > > >>>>>> which >> > > > >>>>>>>>>> supports both protocol versions. (let's say in 3.5.8) >> Then >> > > > >> we >> > > > >>>> can >> > > > >>>>>> write >> > > > >>>>>>>> to >> > > > >>>>>>>>>> the release note, that if you need rolling upgrade from >> any >> > > > >>>>> versions >> > > > >>>>>>>> since >> > > > >>>>>>>>>> 3.4.7, then you have to first upgrade from 3.5.8 before >> > > > >>>> upgrading >> > > > >>>>> to >> > > > >>>>>>>> 3.6.0. >> > > > >>>>>>>>>> We can even make the same thing on the 3.4 branch. >> > > > >>>>>>>>>> >> > > > >>>>>>>>>> But I am also new to the community... It would be great >> to >> > > > >> hear >> > > > >>>>> the >> > > > >>>>>>>> opinion >> > > > >>>>>>>>>> of more experienced people. >> > > > >>>>>>>>>> Whatever the decision will be, I am happy to make the >> > > > >> changes. >> > > > >>>>>>>>>> >> > > > >>>>>>>>>> And sorry for breaking the RC (if we decide that this >> needs >> > > > >> to >> > > > >>>> be >> > > > >>>>>>>>>> changed...). ZOOKEEPER-3188 was a complex patch. >> > > > >>>>>>>>>> >> > > > >>>>>>>>>> Kind regards, >> > > > >>>>>>>>>> Mate >> > > > >>>>>>>>>> >> > > > >>>>>>>>>> On Mon, Feb 10, 2020 at 9:47 AM Enrico Olivelli < >> > > > >>>>>> [email protected]> >> > > > >>>>>>>> wrote: >> > > > >>>>>>>>>> >> > > > >>>>>>>>>>> Hi, >> > > > >>>>>>>>>>> even if we had enough binding +1 on 3.6.0rc2 before >> > > > >> closing the >> > > > >>>>>> VOTE >> > > > >>>>>>>>>>> of 3.6.0 I wanted to finish my tests and I am coming to >> an >> > > > >>>>> apparent >> > > > >>>>>>>>>>> blocker. >> > > > >>>>>>>>>>> >> > > > >>>>>>>>>>> I am trying to upgrade a 3.5.6 cluster to 3.6.0, but it >> > > > >> looks >> > > > >>>>> like >> > > > >>>>>>>>>>> peers are not able to talk to each other. >> > > > >>>>>>>>>>> I have a cluster of 3, server1, server2 and server3. >> > > > >>>>>>>>>>> When I upgrade server1 to 3.6.0rc2 I see this kind of >> > > > >> errors on >> > > > >>>>> 3.5 >> > > > >>>>>>>> nodes: >> > > > >>>>>>>>>>> >> > > > >>>>>>>>>>> 2020-02-10 09:35:07,745 [myid:3] - INFO >> > > > >>>>>>>>>>> [localhost/127.0.0.1:3334:QuorumCnxManager$Listener@918] >> - >> > > > >>>>>> Received >> > > > >>>>>>>>>>> connection request 127.0.0.1:62591 >> > > > >>>>>>>>>>> 2020-02-10 09:35:07,746 [myid:3] - ERROR >> > > > >>>>>>>>>>> [localhost/127.0.0.1:3334:QuorumCnxManager@527] - >> > > > >>>>>>>>>>> >> > > > >>>>>>>>>>> >> > > > >>>>>>>> >> > > > >>>>>> >> > > > >>>>> >> > > > >>>> >> > > > >> >> > > >> > >> org.apache.zookeeper.server.quorum.QuorumCnxManager$InitialMessage$InitialMessageException: >> > > > >>>>>>>>>>> Got unrecognized protocol version -65535 >> > > > >>>>>>>>>>> >> > > > >>>>>>>>>>> Once I upgrade all of the peers the system is up and >> > > > >> running, >> > > > >>>>>> without >> > > > >>>>>>>>>>> apparently no data loss. >> > > > >>>>>>>>>>> >> > > > >>>>>>>>>>> During the upgrade as soon as I upgrade the first node, >> > > > >> say, >> > > > >>>>>> server1, >> > > > >>>>>>>>>>> server1 is not able to accept connections (error "Close >> of >> > > > >>>>> session >> > > > >>>>>> 0x0 >> > > > >>>>>>>>>>> java.io.IOException: ZooKeeperServer not running") from >> > > > >>>> clients, >> > > > >>>>>> this >> > > > >>>>>>>>>>> is expected, because as far as it cannot talk with the >> > > > >> other >> > > > >>>>> peers >> > > > >>>>>> it >> > > > >>>>>>>>>>> is practically partitioned away from the cluster. >> > > > >>>>>>>>>>> >> > > > >>>>>>>>>>> My questions are: >> > > > >>>>>>>>>>> 1) is this expected ? I can't remember protocol changes >> > > > >> from >> > > > >>>> 3.5 >> > > > >>>>> to >> > > > >>>>>>>>>>> 3.6, but actually 3.6 diverged from 3.5 branch so long >> ago, >> > > > >>>> and I >> > > > >>>>>> was >> > > > >>>>>>>>>>> not in the community as dev so I cannot tell >> > > > >>>>>>>>>>> 2) is this a viable option for users ? to have some >> > > > >> temporary >> > > > >>>>>> glitch >> > > > >>>>>>>>>>> during the upgrade and hope that the upgrade completes >> > > > >> without >> > > > >>>>>>>>>>> troubles ? >> > > > >>>>>>>>>>> >> > > > >>>>>>>>>>> In theory as long as two servers are running the same >> major >> > > > >>>>> version >> > > > >>>>>>>>>>> (3.5 or 3.6) we have a quorum and the system is able to >> > > > >> make >> > > > >>>>>> progress >> > > > >>>>>>>>>>> and to server clients. >> > > > >>>>>>>>>>> I feel that this is quite dangerous, but I don't have >> > > > >> enough >> > > > >>>>>> context >> > > > >>>>>>>>>>> to understand how this problem is possible and when we >> > > > >> decided >> > > > >>>> to >> > > > >>>>>>>>>>> break compatibility. >> > > > >>>>>>>>>>> >> > > > >>>>>>>>>>> The other option is that I am wrong in my test and I am >> > > > >> messing >> > > > >>>>> up >> > > > >>>>>> :-) >> > > > >>>>>>>>>>> >> > > > >>>>>>>>>>> The other upgrade path I would like to see working like >> a >> > > > >> charm >> > > > >>>>> is >> > > > >>>>>> the >> > > > >>>>>>>>>>> upgrade from 3.4 to 3.6, as I see that as soon as we >> > > > >> release >> > > > >>>> 3.6 >> > > > >>>>> we >> > > > >>>>>>>>>>> should encourage users to move to 3.6 and not to 3.5. >> > > > >>>>>>>>>>> >> > > > >>>>>>>>>>> Regards >> > > > >>>>>>>>>>> Enrico >> > > > >>>>>>>>>>> >> > > > >>>>>>>> >> > > > >>>>>>>> >> > > > >>>>>> >> > > > >>>>> >> > > > >>>> >> > > > >> >> > > > >> > > >> > >> >>
