Re: [DISCUSS] CloudStack graceful shutdown

ilya musayev Sat, 21 Apr 2018 12:05:09 -0700

Rafael

What you are suggesting - was already implemented. We've created Load
Balancing algorithms - but we did not take into account the LB algo for
maintenance (yet). Rohit and ShapeBlue were the developers behind the
feature.


What needs to happen is a tweak to LB Algorithms to become MS maintenance
aware - or create new LB Algos altogether. Essentially we need to merge
your work and this feature. Please read the FS below.

Functional Spec:


The new CA framework introduced basic support for comma-separated
list of management servers for agent, which makes an external LB
unnecessary.

This extends that feature to implement LB sorting algorithms that
sorts the management server list before they are sent to the agents.
This adds a central intelligence in the management server and adds
additional enhancements to Agent class to be algorithm aware and
have a background mechanism to check/fallback to preferred management
server (assumed as the first in the list). This is support for any
indirect agent such as the KVM, CPVM and SSVM agent, and would
provide support for management server host migration during upgrade
(when instead of in-place, new hosts are used to setup new mgmt server).

This FR introduces two new global settings:

   - indirect.agent.lb.algorithm: The algorithm for the indirect agent LB.
   - indirect.agent.lb.check.interval: The preferred host check interval
   for the agent's background task that checks and switches to agent's
   preferred host.

The indirect.agent.lb.algorithm supports following algorithm options:

   - static: use the list as provided.
   - roundrobin: evenly spreads hosts across management servers based on
   host's id.
   - shuffle: (pseudo) randomly sorts the list (not recommended for
   production).

Any changes to the global settings - indirect.agent.lb.algorithm and
host does not require restarting of the mangement server(s) and the
agents. A message bus based system dynamically reacts to change in these
global settings and propagates them to all connected agents.

Comma-separated management server list is propagated to agents on
following cases:

   - Addition of a host (including ssvm, cpvm systevms).
   - Connection or reconnection by the agents to a management server.
   - After admin changes the 'host' and/or the
   'indirect.agent.lb.algorithm' global settings.

On the agent side, the 'host' setting is saved in its properties file as:
host=<comma separated addresses>@<algorithm name>.

First the agent connects to the management server and sends its current
management server list, which is compared by the management server and
in case of failure a new/update list is sent for the agent to persist.

>From the agent's perspective, the first address in the propagated list
will be considered the preferred host. A new background task can be
activated by configuring the indirect.agent.lb.check.interval which is
a cluster level global setting from CloudStack and admins can also
override this by configuring the 'host.lb.check.interval' in the
agent.properties file.

Every time agent gets a ms-host list and the algorithm, the host specific
background check interval is also sent and it dynamically reconfigures
the background task without need to restart agents.

Note: The 'static' and 'roundrobin' algorithms, strictly checks for the
order as expected by them, however, the 'shuffle' algorithm just checks
for content and not the order of the comma separate ms host addresses.

Regards
ilya


On Fri, Apr 20, 2018 at 1:01 PM, Rafael Weingärtner <
[email protected]> wrote:

> Is that management server load balancing feature using static
> configurations? I heard about it on the mailing list, but I did not follow
> the implementation.
>
> I do not see many problems with agents reconnecting. We can implement in
> agents (not just KVM, but also system VMs) a logic that instead of using a
> static pool of management servers configured in a properties file, they
> dynamically request a list of available management servers via that list
> management servers API method. This would require us to configure agents
> with a load balancer URL that executes the balancing between multiple
> management servers.
>
> I am +1 to remove the need for that VIP, which executes the load balance
> for connecting agents to management servers.
>
> On Fri, Apr 20, 2018 at 4:41 PM, ilya musayev <
> [email protected]>
> wrote:
>
> > Rafael and Community
> >
> > All is well and good and i think we are thinking along the similar lines
> -
> > the only issue that i see right now with any approach is KVM Agents (or
> > direct agents) and using LoadBalancer on 8250.
> >
> > Here is a scenario:
> >
> > You have 2 Management Server setup fronted with a VIP on 8250.
> > The LB Algorithm is either Round Robin or Least Connections used.
> > You initiate a maintenance mode operation on one of the MS servers (call
> it
> > MS1) - assume you have a long running migration job that needs 60 minutes
> > to complete.
> > We attempt to evacuate the agents by telling them to disconnect and
> > reconnect again
> > If we are using LB on 8250 with
> > 1) Least Connection used - then all agents will continuously try to
> connect
> > to a MS1 node that is attempting to go down for maintenance. Essentially
> > with this  LB configuration this operation will never
> > 2) Round Robin - this will take a while - but eventually - you will get
> all
> > nodes connected to MS2
> >
> > The current limitation is usage of external LB on 8250. For this
> operation
> > to work without issue - would mean agents must connect to MS server
> without
> > an LB. This is a recent feature we've developed with ShapeBlue - where we
> > maintain the list of CloudStack Management Servers in the
> agent.properties
> > file.
> >
> > Unless you can think of other solution - it appears we may have to forced
> > to bypass the 8250 VIP LB and use the new feature to maintain the list of
> > management servers within agent.properties.
> >
> >
> > I need to run now, let me know what your thoughts are.
> >
> > Regards
> > ilya
> >
> >
> >
> > On Tue, Apr 17, 2018 at 8:27 AM, Rafael Weingärtner <
> > [email protected]> wrote:
> >
> > > Ilya and others,
> > >
> > > We have been discussing this idea of graceful/nicely shutdown.  Our
> > feeling
> > > is that we (in CloudStack community) might have been trying to solve
> this
> > > problem with too much scripting. What if we developed a more integrated
> > > (native) solution?
> > >
> > > Let me explain our idea.
> > >
> > > ACS has a table called “mshost”, which is used to store management
> server
> > > information. During balancing and when jobs are dispatched to other
> > > management servers this table is consulted/queried.  Therefore, we have
> > > been discussing the idea of creating a management API for management
> > > servers.  We could have an API method that changes the state of
> > management
> > > servers to “prepare to maintenance” and then “maintenance” (as soon as
> > all
> > > of the task/jobs it is managing finish). The idea is that during
> > > rebalancing we would remove the hosts of servers that are not in “Up”
> > state
> > > (of course we would also ignore hosts in the aforementioned state to
> > > receive hosts to manage).  Moreover, when we send/dispatch jobs to
> other
> > > management servers, we could ignore the ones that are not in “Up” state
> > > (which is something already done).
> > >
> > > By doing this, the nicely shutdown could be executed in a few steps.
> > >
> > > 1 – issue the maintenance method for the management server you desire
> > > 2 – wait until the MS goes into maintenance mode, while there are still
> > > running jobs it (the management server) will be maintained in prepare
> for
> > > maintenance
> > > 3 – execute the Linux shutdown command
> > >
> > > We would need other APIs methods to manage MSs then. An (i) API method
> to
> > > list MSs, and we could even create an (ii) API to remove
> old/de-activated
> > > management servers, which we currently do not have (forcing users to
> > apply
> > > changed directly in the database).
> > >
> > > Moreover, in this model, we would not kill hanging jobs; we would wait
> > > until they expire and ACS expunges them. Of course, it is possible to
> > > develop a forceful maintenance method as well. Then, when the “prepare
> > for
> > > maintenance” takes longer than a parameter, we could kill hanging jobs.
> > >
> > > All of this would allow the MS to be kept up and receiving requests
> until
> > > it can be safely shutdown. What do you guys about this approach?
> > >
> > > On Tue, Apr 10, 2018 at 6:52 PM, Yiping Zhang <[email protected]>
> > wrote:
> > >
> > > > As a cloud admin, I would love to have this feature.
> > > >
> > > > It so happens that I just accidentally restarted my ACS management
> > server
> > > > while two instances are migrating to another Xen cluster (via storage
> > > > migration, not live migration).  As results, both instances
> > > > ends up with corrupted data disk which can't be reattached or
> migrated.
> > > >
> > > > Any feature which prevents this from happening would be great.  A low
> > > > hanging fruit is simply checking for
> > > > if there are any async jobs running, especially any kind of migration
> > > jobs
> > > > or other known long running type of
> > > > jobs and warn the operator  so that he has a chance to abort server
> > > > shutdowns.
> > > >
> > > > Yiping
> > > >
> > > > On 4/5/18, 3:13 PM, "ilya musayev" <[email protected]>
> > > wrote:
> > > >
> > > >     Andrija
> > > >
> > > >     This is a tough scenario.
> > > >
> > > >     As an admin, they way i would have handled this situation, is to
> > > > advertise
> > > >     the upcoming outage and then take away specific API commands
> from a
> > > > user a
> > > >     day before - so he does not cause any long running async jobs.
> Once
> > > >     maintenance completes - enable the API commands back to the user.
> > > > However -
> > > >     i dont know who your user base is and if this would be an
> > acceptable
> > > >     solution.
> > > >
> > > >     Perhaps also investigate what can be done to speed up your long
> > > running
> > > >     tasks...
> > > >
> > > >     As a side node, we will be working on a feature that would allow
> > for
> > > a
> > > >     graceful termination of the process/job, meaning if agent
> noticed a
> > > >     disconnect or termination request - it will abort the command in
> > > > flight. We
> > > >     can also consider restarting this tasks again or what not - but
> it
> > > > would
> > > >     not be part of this enhancement.
> > > >
> > > >     Regards
> > > >     ilya
> > > >
> > > >     On Thu, Apr 5, 2018 at 6:47 AM, Andrija Panic <
> > > [email protected]
> > > > >
> > > >     wrote:
> > > >
> > > >     > Hi Ilya,
> > > >     >
> > > >     > thanks for the feedback - but in "real world", you need to
> > > > "understand"
> > > >     > that 60min is next to useless timeout for some jobs (if I
> > > understand
> > > > this
> > > >     > specific parameter correctly ?? - job is really canceled, not
> > only
> > > > job
> > > >     > monitoring is canceled ???) -
> > > >     >
> > > >     > My value for the  "job.cancel.threshold.minutes" is 2880
> minutes
> > (2
> > > > days?)
> > > >     >
> > > >     > I can tell you when you have CEPH/NFS (CEPH even "worse" case,
> > > since
> > > > slower
> > > >     > read durign qemu-img convert process...) of 500GB, then imagine
> > > > snapshot
> > > >     > job will take many hours. Should I mention 1TB volumes (yes, we
> > had
> > > >     > client's like that...)
> > > >     > Than attaching 1TB volume, that was uploaded to ACS (lives
> > > > originally on
> > > >     > Secondary Storage, and takes time to be copied over to
> NFS/CEPH)
> > > > will take
> > > >     > up to few hours.
> > > >     > Then migrating 1TB volume from NFS to CEPH, or CEPH to NFS,
> also
> > > > takes
> > > >     > time...etc.
> > > >     >
> > > >     > I'm just giving you feedback as "user", admin of the cloud,
> zero
> > > DEV
> > > > skills
> > > >     > here :) , just to make sure you make practical decisions (and I
> > > > admit I
> > > >     > might be wrong with my stuff, but just giving you feedback from
> > our
> > > > public
> > > >     > cloud setup)
> > > >     >
> > > >     >
> > > >     > Cheers!
> > > >     >
> > > >     >
> > > >     >
> > > >     >
> > > >     > On 5 April 2018 at 15:16, Tutkowski, Mike <
> > > [email protected]
> > > > >
> > > >     > wrote:
> > > >     >
> > > >     > > Wow, there’s been a lot of good details noted from several
> > people
> > > > on how
> > > >     > > this process works today and how we’d like it to work in the
> > near
> > > > future.
> > > >     > >
> > > >     > > 1) Any chance this is already documented on the Wiki?
> > > >     > >
> > > >     > > 2) If not, any chance someone would be willing to do so (a
> flow
> > > > diagram
> > > >     > > would be particularly useful).
> > > >     > >
> > > >     > > > On Apr 5, 2018, at 3:37 AM, Marc-Aurèle Brothier <
> > > > [email protected]>
> > > >     > > wrote:
> > > >     > > >
> > > >     > > > Hi all,
> > > >     > > >
> > > >     > > > Good point ilya but as stated by Sergey there's more thing
> to
> > > > consider
> > > >     > > > before being able to do a proper shutdown. I augmented my
> > > script
> > > > I gave
> > > >     > > you
> > > >     > > > originally and changed code in CS. What we're doing for our
> > > > environment
> > > >     > > is
> > > >     > > > as follow:
> > > >     > > >
> > > >     > > > 1. the MGMT looks for a change in the file /etc/lb-agent
> > which
> > > > contains
> > > >     > > > keywords for HAproxy[2] (ready, maint) so that HA-proxy can
> > > > disable the
> > > >     > > > mgmt on the keyword "maint" and the mgmt server stops a
> > couple
> > > of
> > > >     > > > threads[1] to stop processing async jobs in the queue
> > > >     > > > 2. Looks for the async jobs and wait until there is none to
> > > > ensure you
> > > >     > > can
> > > >     > > > send the reconnect commands (if jobs are running, a
> reconnect
> > > > will
> > > >     > result
> > > >     > > > in a failed job since the result will never reach the
> > > management
> > > >     > server -
> > > >     > > > the agent waits for the current job to be done before
> > > > reconnecting, and
> > > >     > > > discard the result... rooms for improvement here!)
> > > >     > > > 3. Issue a reconnectHost command to all the hosts connected
> > to
> > > > the mgmt
> > > >     > > > server so that they reconnect to another one, otherwise the
> > > mgmt
> > > > must
> > > >     > be
> > > >     > > up
> > > >     > > > since it is used to forward commands to agents.
> > > >     > > > 4. when all agents are reconnected, we can shutdown the
> > > > management
> > > >     > server
> > > >     > > > and perform the maintenance.
> > > >     > > >
> > > >     > > > One issue remains for me, during the reconnect, the
> commands
> > > > that are
> > > >     > > > processed at the same time should be kept in a queue until
> > the
> > > > agents
> > > >     > > have
> > > >     > > > finished any current jobs and have reconnected. Today the
> > > little
> > > > time
> > > >     > > > window during which the reconnect happens can lead to
> failed
> > > > jobs due
> > > >     > to
> > > >     > > > the agent not being connected at the right moment.
> > > >     > > >
> > > >     > > > I could push a PR for the change to stop some processing
> > > threads
> > > > based
> > > >     > on
> > > >     > > > the content of a file. It's possible also to cancel the
> drain
> > > of
> > > > the
> > > >     > > > management by simply changing the content of the file back
> to
> > > > "ready"
> > > >     > > > again, instead of "maint" [2].
> > > >     > > >
> > > >     > > > [1] AsyncJobMgr-Heartbeat, CapacityChecker, StatsCollector
> > > >     > > > [2] HA proxy documentation on agent checker:
> > > > https://cbonte.github.io/
> > > >     > > > haproxy-dconv/1.6/configuration.html#5.2-agent-check
> > > >     > > >
> > > >     > > > Regarding your issue on the port blocking, I think it's
> fair
> > to
> > > >     > consider
> > > >     > > > that if you want to shutdown your server at some point, you
> > > have
> > > > to
> > > >     > stop
> > > >     > > > serving (some) requests. Here the only way it's to stop
> > serving
> > > >     > > everything.
> > > >     > > > If the API had a REST design, we could reject any
> > > POST/PUT/DELETE
> > > >     > > > operations and allow GET ones. I don't know how hard it
> would
> > > be
> > > > today
> > > >     > to
> > > >     > > > only allow listBaseCmd operations to be more friendly with
> > the
> > > > users.
> > > >     > > >
> > > >     > > > Marco
> > > >     > > >
> > > >     > > >
> > > >     > > > On Thu, Apr 5, 2018 at 2:22 AM, Sergey Levitskiy <
> > > > [email protected]>
> > > >     > > > wrote:
> > > >     > > >
> > > >     > > >> Now without spellchecking :)
> > > >     > > >>
> > > >     > > >> This is not simple e.g. for VMware. Each management server
> > > also
> > > > acts
> > > >     > as
> > > >     > > an
> > > >     > > >> agent proxy so tasks against a particular ESX host will be
> > > > always
> > > >     > > >> forwarded. That right answer will be to support a native
> > > > “maintenance
> > > >     > > mode”
> > > >     > > >> for management server. When entered to such mode the
> > > management
> > > > server
> > > >     > > >> should release all agents including SSVM, block/redirect
> API
> > > > calls and
> > > >     > > >> login request and finish all async job it originated.
> > > >     > > >>
> > > >     > > >>
> > > >     > > >>
> > > >     > > >> On Apr 4, 2018, at 5:15 PM, Sergey Levitskiy <
> > > > [email protected]
> > > >     > > <mailto:
> > > >     > > >> [email protected]>> wrote:
> > > >     > > >>
> > > >     > > >> This is not simple e.g. for VMware. Each management server
> > > also
> > > > acts
> > > >     > as
> > > >     > > an
> > > >     > > >> agent proxy so tasks against a particular ESX host will be
> > > > always
> > > >     > > >> forwarded. That right answer will be to a native support
> for
> > > >     > > “maintenance
> > > >     > > >> mode” for management server. When entered to such mode the
> > > > management
> > > >     > > >> server should release all agents including save,
> > > block/redirect
> > > > API
> > > >     > > calls
> > > >     > > >> and login request and finish all a sync job it originated.
> > > >     > > >>
> > > >     > > >> Sent from my iPhone
> > > >     > > >>
> > > >     > > >> On Apr 4, 2018, at 3:31 PM, Rafael Weingärtner <
> > > >     > > >> [email protected]<mailto:rafaelweingartner@
> > > gmail.com
> > > > >>
> > > >     > wrote:
> > > >     > > >>
> > > >     > > >> Ilya, still regarding the management server that is being
> > shut
> > > > down
> > > >     > > issue;
> > > >     > > >> if other MSs/or maybe system VMs (I am not sure to know if
> > > they
> > > > are
> > > >     > > able to
> > > >     > > >> do such tasks) can direct/redirect/send new jobs to this
> > > > management
> > > >     > > server
> > > >     > > >> (the one being shut down), the process might never end
> > because
> > > > new
> > > >     > tasks
> > > >     > > >> are always being created for the management server that we
> > > want
> > > > to
> > > >     > shut
> > > >     > > >> down. Is this scenario possible?
> > > >     > > >>
> > > >     > > >> That is why I mentioned blocking the port 8250 for the
> > > >     > > “graceful-shutdown”.
> > > >     > > >>
> > > >     > > >> If this scenario is not possible, then everything s fine.
> > > >     > > >>
> > > >     > > >>
> > > >     > > >> On Wed, Apr 4, 2018 at 7:14 PM, ilya musayev <
> > > >     > > [email protected]
> > > >     > > >> <mailto:[email protected]>>
> > > >     > > >> wrote:
> > > >     > > >>
> > > >     > > >> I'm thinking of using a configuration from
> > > >     > > "job.cancel.threshold.minutes" -
> > > >     > > >> it will be the longest
> > > >     > > >>
> > > >     > > >>    "category": "Advanced",
> > > >     > > >>
> > > >     > > >>    "description": "Time (in minutes) for async-jobs to be
> > > > forcely
> > > >     > > >> cancelled if it has been in process for long",
> > > >     > > >>
> > > >     > > >>    "name": "job.cancel.threshold.minutes",
> > > >     > > >>
> > > >     > > >>    "value": "60"
> > > >     > > >>
> > > >     > > >>
> > > >     > > >>
> > > >     > > >>
> > > >     > > >> On Wed, Apr 4, 2018 at 1:36 PM, Rafael Weingärtner <
> > > >     > > >> [email protected]<mailto:rafaelweingartner@
> > > gmail.com
> > > > >>
> > > >     > wrote:
> > > >     > > >>
> > > >     > > >> Big +1 for this feature; I only have a few doubts.
> > > >     > > >>
> > > >     > > >> * Regarding the tasks/jobs that management servers (MSs)
> > > > execute; are
> > > >     > > >> these
> > > >     > > >> tasks originate from requests that come to the MS, or is
> it
> > > > possible
> > > >     > > that
> > > >     > > >> requests received by one management server to be executed
> by
> > > > other? I
> > > >     > > >> mean,
> > > >     > > >> if I execute a request against MS1, will this request
> always
> > > be
> > > >     > > >> executed/threated by MS1, or is it possible that this
> > request
> > > is
> > > >     > > executed
> > > >     > > >> by another MS (e.g. MS2)?
> > > >     > > >>
> > > >     > > >> * I would suggest that after we block traffic coming from
> > > >     > > >> 8080/8443/8250(we
> > > >     > > >> will need to block this as well right?), we can log the
> > > > execution of
> > > >     > > >> tasks.
> > > >     > > >> I mean, something saying, there are XXX tasks (enumerate
> > > tasks)
> > > > still
> > > >     > > >> being
> > > >     > > >> executed, we will wait for them to finish before shutting
> > > down.
> > > >     > > >>
> > > >     > > >> * The timeout (60 minutes suggested) could be global
> > settings
> > > > that we
> > > >     > > can
> > > >     > > >> load before executing the graceful-shutdown.
> > > >     > > >>
> > > >     > > >> On Wed, Apr 4, 2018 at 5:15 PM, ilya musayev <
> > > >     > > >> [email protected]<mailto:ilya.mailing.lists@
> > > > gmail.com>
> > > >     > > >>
> > > >     > > >> wrote:
> > > >     > > >>
> > > >     > > >> Use case:
> > > >     > > >> In any environment - time to time - administrator needs to
> > > > perform a
> > > >     > > >> maintenance. Current stop sequence of cloudstack
> management
> > > > server
> > > >     > will
> > > >     > > >> ignore the fact that there may be long running async jobs
> -
> > > and
> > > >     > > >> terminate
> > > >     > > >> the process. This in turn can create a poor user
> experience
> > > and
> > > >     > > >> occasional
> > > >     > > >> inconsistency  in cloudstack db.
> > > >     > > >>
> > > >     > > >> This is especially painful in large environments where the
> > > user
> > > > has
> > > >     > > >> thousands of nodes and there is a continuous patching that
> > > > happens
> > > >     > > >> around
> > > >     > > >> the clock - that requires migration of workload from one
> > node
> > > to
> > > >     > > >> another.
> > > >     > > >>
> > > >     > > >> With that said - i've created a script that monitors the
> > async
> > > > job
> > > >     > > >> queue
> > > >     > > >> for given MS and waits for it complete all jobs. More
> > details
> > > > are
> > > >     > > >> posted
> > > >     > > >> below.
> > > >     > > >>
> > > >     > > >> I'd like to introduce "graceful-shutdown" into the
> > > > systemctl/service
> > > >     > of
> > > >     > > >> cloudstack-management service.
> > > >     > > >>
> > > >     > > >> The details of how it will work is below:
> > > >     > > >>
> > > >     > > >> Workflow for graceful shutdown:
> > > >     > > >> Using iptables/firewalld - block any connection attempts
> on
> > > > 8080/8443
> > > >     > > >> (we
> > > >     > > >> can identify the ports dynamically)
> > > >     > > >> Identify the MSID for the node, using the proper msid -
> > query
> > > >     > > >> async_job
> > > >     > > >> table for
> > > >     > > >> 1) any jobs that are still running (or job_status=“0”)
> > > >     > > >> 2) job_dispatcher not like “pseudoJobDispatcher"
> > > >     > > >> 3) job_init_msid=$my_ms_id
> > > >     > > >>
> > > >     > > >> Monitor this async_job table for 60 minutes - until all
> > async
> > > > jobs for
> > > >     > > >> MSID
> > > >     > > >> are done, then proceed with shutdown
> > > >     > > >>  If failed for any reason or terminated, catch the exit
> via
> > > trap
> > > >     > > >> command
> > > >     > > >> and unblock the 8080/8443
> > > >     > > >>
> > > >     > > >> Comments are welcome
> > > >     > > >>
> > > >     > > >> Regards,
> > > >     > > >> ilya
> > > >     > > >>
> > > >     > > >>
> > > >     > > >>
> > > >     > > >>
> > > >     > > >> --
> > > >     > > >> Rafael Weingärtner
> > > >     > > >>
> > > >     > > >>
> > > >     > > >>
> > > >     > > >>
> > > >     > > >>
> > > >     > > >> --
> > > >     > > >> Rafael Weingärtner
> > > >     > > >>
> > > >     > >
> > > >     >
> > > >     >
> > > >     >
> > > >     > --
> > > >     >
> > > >     > Andrija Panić
> > > >     >
> > > >
> > > >
> > > >
> > >
> > >
> > > --
> > > Rafael Weingärtner
> > >
> >
>
>
>
> --
> Rafael Weingärtner
>

Re: [DISCUSS] CloudStack graceful shutdown

Reply via email to