Eugene With all due respect to you and other OpenStack developers, I as a system administrator do not believe when someone says that something is working that way. Actually, what I would prefer to do is to stress-test these services on their 'statelessness'. Currently we have l3-agent not so stateless and lacking centralized synchronization in proper way which you have no actually defied. So I agree - let's move this into different thread and not hijack this one.
On Tue, Oct 6, 2015 at 5:11 PM, Eugene Nikanorov <[email protected]> wrote: > > On Tue, Oct 6, 2015 at 4:22 PM, Vladimir Kuklin <[email protected]> > wrote: > >> Eugene >> >> For example, each time that you need to have one instance (e.g. master >> instance) of something non-stateless running in the cluster. >> > > Right. This is theoretical. Practically, there are no such services among > openstack. > > You are right that currently lots of things are fixed already - heat >> engine is fine, for example. But I still see this issue with l3 agents and >> I will not change my mind until we conduct complete scale and destructive >> testing with new neutron code. >> >> Secondly, if we cannot reliably identify when to engage - then we need to >> write the code that will tell us when to engage. If this code is already in >> place and we can trigger a couple of commands to figure out Neutron agent >> state, then we can add them to OCF script monitor and that is all. I agree >> that we have some issues with our OCF scripts, for example some unoptimal >> cleanup code that has issues with big scale, but I am almost sure we can >> fix it. >> >> Finally, let me show an example of when you need a centralized cluster >> manager to manage such situations - you have a temporary issue with >> connectivity to neutron server over management network for some reason. >> Your agents are not cleaned up and neutron server starts new l3 agent >> instances on different node. In this case you will have IP duplication in >> the network and will bring down the whole cluster as connectivity through >> 'public' network will be working just fine. In case when we are using >> Pacemaker - such node will be either fenced or will stop all the services >> controlled by pacemaker as it is a part of non-quorate partition of the >> cluster. When this happens, l3 agent OCF script will run its cleanup >> section and purge all the stale IPs thus saving us from the trouble. I >> obviously may be mistaking, so please correct me if this is not the case. >> > I think this deserves discussion in a separate thread, which I'll start > soon. > My initial point was (to state it clearly), that I will be -2 on any new > additions of openstack services to pacemaker kingdom. > > Thanks, > Eugene. > >> >> >> On Tue, Oct 6, 2015 at 3:46 PM, Eugene Nikanorov <[email protected] >> > wrote: >> >>> >>> >>>> 2) I think you misunderstand what is the difference between >>>> upstart/systemd and Pacemaker in this case. There are many cases when you >>>> need to have syncrhonized view of the cluster. Otherwise you will hit >>>> split-brain situations and have your cluster misfunctioning. Until >>>> OpenStack provides us with such means there is no other way than using >>>> Pacemaker/Zookeper/etc. >>>> >>> >>> Could you please give some examples of those 'many cases' for openstack >>> specifically? >>> As for my 'misunderstanding' - openstack services only need to be always >>> up, not more than that. >>> Upstart does a perfect job there. >>> >>> >>>> 3) Regarding Neutron agents - we discussed it many times - you need to >>>> be able to control and clean up stuff after some service crashed. >>>> Currently, Neutron does not provide reliable ways to do it. If your agent >>>> dies and does not clean up ip addresses from the network namespace you will >>>> get into the situation of ARP duplication which will be a kind of split >>>> brain described in item #2. I personally as a system architect and >>>> administrator do not believe for this to change in at least several years >>>> for OpenStack so we will be using Pacemaker for a very long period of time. >>>> >>> >>> This has been changed already, and a while ago. >>> OCF infrastructure around neutron agents has never helped neutron in any >>> meaningful way and is just an artifact from the dark past. >>> The reasons are: pacemaker/ocf doesn't have enough intelligence to know >>> when to engage, as a result, any cleanup could only be achieved through >>> manual operations. I don't need to remind you how many bugs were in ocf >>> scripts which brought whole clusters down after those manual operations. >>> So it's just a way better to go with simple standard tools with >>> fine-grain control. >>> Same applies to any other openstack service (again, not rabbitmq/galera) >>> >>> > so we will be using Pacemaker for a very long period of time. >>> Not for neutron, sorry. As soon as we finish the last bit of such >>> cleanup, which is targeted for 8.0 >>> >>> Now, back to the topic - we may decide to use some more sophisticated >>>> integral node health attribute which can be used with Pacemaker as well as >>>> to put node into some kind of maintenance mode. We can leverage User >>>> Maintenance Mode feature here or just simply stop particular services and >>>> disable particular haproxy backends. >>>> >>> >>> I think this kind of attribute, although being analyzed by >>> pacemaker/ocf, doesn't need any new OS service to be put under pacemaker >>> control. >>> >>> Thanks, >>> Eugene. >>> >>> >>>> >>>> On Mon, Oct 5, 2015 at 11:57 PM, Eugene Nikanorov < >>>> [email protected]> wrote: >>>> >>>>> >>>>>>> >>>>>> Mirantis does control neither Rabbitmq or Galera. Mirantis cannot >>>>>> assure their quality as well. >>>>>> >>>>> >>>>> Correct, and rabbitmq was always the pain in the back, preventing any >>>>> *real >>>>> *enterprise usage of openstack where reliability does matter. >>>>> >>>>> >>>>>> > 2) it has terrible UX >>>>>>> >>>>>> >>>>>> It looks like personal opinion. I'd like to see surveys or operators >>>>>> feedbacks. Also, this statement is not constructive as it doesn't have >>>>>> alternative solutions. >>>>>> >>>>> >>>>> The solution is to get rid of terrible UX wherever possible (i'm not >>>>> saying it is always possible, of course) >>>>> upstart is just so much better. >>>>> And yes, this is my personal opinion and is a summary of escalation >>>>> team's experience. >>>>> >>>>> >>>>>> >>>>>>> > 3) it is not reliable >>>>>>> >>>>>> >>>>>> I would say openstack services are not HA reliable. So OCF scripts >>>>>> are reaction of operators on these problems. Many of them have child-ish >>>>>> issues from release to release. Operators made OCF scripts to fix these >>>>>> problems. A lot of openstack are stateful, so they require some kind of >>>>>> stickiness or synchronization. Openstack services doesn't have simple >>>>>> health-check functionality so it's hard to say it's running well or not. >>>>>> Sighup is still a problem for many of openstack services. Etc/etc So, >>>>>> let's >>>>>> be constructive here. >>>>>> >>>>> >>>>> Well, I prefer to be responsible for what I know and maintain. Thus, I >>>>> state that neutron doesn't need to be managed by pacemaker, neither >>>>> server, >>>>> nor all kinds of agents, and that's the path that neutron team will be >>>>> taking. >>>>> >>>>> Thanks, >>>>> Eugene. >>>>> >>>>>> >>>>>> >>>>>>> > >>>>>>> >>>>>>> I disagree with #1 as I do not agree that should be a criteria for an >>>>>>> open-source project. Considering pacemaker is at the core of our >>>>>>> controller setup, I would argue that if these are in fact true we >>>>>>> need >>>>>>> to be using something else. I would agree that it is a terrible UX >>>>>>> but all the clustering software I've used fall in this category. I'd >>>>>>> like more information on how it is not reliable. Do we have numbers >>>>>>> to >>>>>>> backup these claims? >>>>>>> >>>>>>> > (3) is not evaluation of the project itself, but just a logical >>>>>>> consequence >>>>>>> > of (1) and (2). >>>>>>> > As a part of escalation team I can say that it has cost our team >>>>>>> thousands >>>>>>> > of man hours of head-scratching, staring at pacemaker logs which >>>>>>> value are >>>>>>> > usually slightly below zero. >>>>>>> > >>>>>>> > Most of openstack services (in fact, ALL api servers) are >>>>>>> stateless, they >>>>>>> > don't require any cluster management (also, they don't need to be >>>>>>> moved in >>>>>>> > case of lack of space). >>>>>>> > Statefull services like neutron agents have their states being a >>>>>>> function of >>>>>>> > db state and are able to syncronize it with the server without >>>>>>> external >>>>>>> > "help". >>>>>>> > >>>>>>> >>>>>>> So it's not an issue with moving services so much as being able to >>>>>>> stop the services when a condition is met. Have we tested all OS >>>>>>> services to ensure they do function 100% when out of disk space? I >>>>>>> would assume that glance might have issues with image uploads if >>>>>>> there >>>>>>> is no space to handle a request. >>>>>>> >>>>>>> > So now usage of pacemaker can be only justified for cases where >>>>>>> service's >>>>>>> > clustering mechanism requires active monitoring (rabbitmq, galera) >>>>>>> > But even there, examples when we are better off without pacemaker >>>>>>> are all >>>>>>> > around. >>>>>>> > >>>>>>> > Thanks, >>>>>>> > Eugene. >>>>>>> > >>>>>>> >>>>>>> After I sent this email, I had further discussions around the issues >>>>>>> that I'm facing and it may not be completely related to disk space. I >>>>>>> think we might be relying on the expectation that the local rabbitmq >>>>>>> is always available but I need to look into that. Either way, I >>>>>>> believe we still should continue to discuss this issue as we are >>>>>>> managing services in multiple ways on a single host. Additionally I >>>>>>> do >>>>>>> not believe that we really perform quality health checks on our >>>>>>> services. >>>>>>> >>>>>>> Thanks, >>>>>>> -Alex >>>>>>> >>>>>>> >>>>>>> > >>>>>>> > On Mon, Oct 5, 2015 at 1:34 PM, Sergey Vasilenko < >>>>>>> [email protected]> >>>>>>> > wrote: >>>>>>> >> >>>>>>> >> >>>>>>> >> On Mon, Oct 5, 2015 at 12:22 PM, Eugene Nikanorov >>>>>>> >> <[email protected]> wrote: >>>>>>> >>> >>>>>>> >>> No pacemaker for os services, please. >>>>>>> >>> We'll be moving out neutron agents from pacemaker control in >>>>>>> 8.0, other >>>>>>> >>> os services don't need it too. >>>>>>> >> >>>>>>> >> >>>>>>> >> could you please provide your arguments. >>>>>>> >> >>>>>>> >> >>>>>>> >> /sv >>>>>>> >> >>>>>>> >> >>>>>>> __________________________________________________________________________ >>>>>>> >> OpenStack Development Mailing List (not for usage questions) >>>>>>> >> Unsubscribe: >>>>>>> [email protected]?subject:unsubscribe >>>>>>> >> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev >>>>>>> >> >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> __________________________________________________________________________ >>>>>>> > OpenStack Development Mailing List (not for usage questions) >>>>>>> > Unsubscribe: >>>>>>> [email protected]?subject:unsubscribe >>>>>>> > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev >>>>>>> > >>>>>>> >>>>>>> >>>>>>> __________________________________________________________________________ >>>>>>> OpenStack Development Mailing List (not for usage questions) >>>>>>> Unsubscribe: >>>>>>> [email protected]?subject:unsubscribe >>>>>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev >>>>>>> >>>>>> >>>>>> >>>>>> >>>>>> __________________________________________________________________________ >>>>>> OpenStack Development Mailing List (not for usage questions) >>>>>> Unsubscribe: >>>>>> [email protected]?subject:unsubscribe >>>>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev >>>>>> >>>>>> >>>>> >>>>> >>>>> __________________________________________________________________________ >>>>> OpenStack Development Mailing List (not for usage questions) >>>>> Unsubscribe: >>>>> [email protected]?subject:unsubscribe >>>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev >>>>> >>>>> >>>> >>>> >>>> -- >>>> Yours Faithfully, >>>> Vladimir Kuklin, >>>> Fuel Library Tech Lead, >>>> Mirantis, Inc. >>>> +7 (495) 640-49-04 >>>> +7 (926) 702-39-68 >>>> Skype kuklinvv >>>> 35bk3, Vorontsovskaya Str. >>>> Moscow, Russia, >>>> www.mirantis.com <http://www.mirantis.ru/> >>>> www.mirantis.ru >>>> [email protected] >>>> >>>> >>>> __________________________________________________________________________ >>>> OpenStack Development Mailing List (not for usage questions) >>>> Unsubscribe: >>>> [email protected]?subject:unsubscribe >>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev >>>> >>>> >>> >>> >>> __________________________________________________________________________ >>> OpenStack Development Mailing List (not for usage questions) >>> Unsubscribe: >>> [email protected]?subject:unsubscribe >>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev >>> >>> >> >> >> -- >> Yours Faithfully, >> Vladimir Kuklin, >> Fuel Library Tech Lead, >> Mirantis, Inc. >> +7 (495) 640-49-04 >> +7 (926) 702-39-68 >> Skype kuklinvv >> 35bk3, Vorontsovskaya Str. >> Moscow, Russia, >> www.mirantis.com <http://www.mirantis.ru/> >> www.mirantis.ru >> [email protected] >> >> __________________________________________________________________________ >> OpenStack Development Mailing List (not for usage questions) >> Unsubscribe: >> [email protected]?subject:unsubscribe >> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev >> >> > > __________________________________________________________________________ > OpenStack Development Mailing List (not for usage questions) > Unsubscribe: [email protected]?subject:unsubscribe > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev > > -- Yours Faithfully, Vladimir Kuklin, Fuel Library Tech Lead, Mirantis, Inc. +7 (495) 640-49-04 +7 (926) 702-39-68 Skype kuklinvv 35bk3, Vorontsovskaya Str. Moscow, Russia, www.mirantis.com <http://www.mirantis.ru/> www.mirantis.ru [email protected]
__________________________________________________________________________ OpenStack Development Mailing List (not for usage questions) Unsubscribe: [email protected]?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
