Re: [PATCH stable-2.17] Prevent watcher from holding lock on verify group
On Fri, Jun 17, 2016 at 04:31:55PM +0100, 'Federico Morg Pareschi' via ganeti-devel wrote: > The ganeti-watcher holds the group file lock for too long, until after > the execution of a group-verify-disk job. This locks for a long time if > there are other jobs already running and blocking the verify from > executing. When the lock is held, another ganeti-watcher run cannot be > scheduled, so this prevents the ganeti-watcher from running for several > minutes. > > With this commit, the lock is released before running the VerifyDisks > operation, so even if the submitted job gets stuck in the Job Queue, a > subsequient ganeti-watcher run would still happen. > > As an additional change, an incorrect docstring was also removed. > > Signed-off-by: Federico Morg PareschiOK, thanks for the patch and the discussion, and sorry for the delay in reviewing this. As you say, we'll have to figure out a good way to avoid multiple redundant VerifyDisks running at once, but we can worry about that later, and for now this LGTM. Brian.
Re: [PATCH stable-2.17] Prevent watcher from holding lock on verify group
On 17 June 2016 at 10:36, Federico Pareschiwrote: > Yes, this is definitely the case. > We have ideas to implement something in the future to obviate this > problem, although we're still considering exactly how to implement it. > This is a short-term quick fix to solve some blocker issues that show up > as a consequence of it. > Ack, thanks for the info! iustin > On 17 June 2016 at 17:57, Iustin Pop wrote: > >> 2016-06-17 9:46 GMT-07:00 Federico Pareschi : >> >>> When a ganeti-watcher runs on the nodegroup, it submits the verify-group >>> job. If there is another job in the queue that is taking some locks that >>> stop the verify-group job (like an instance creation) then the whole >>> ganeti-watcher is blocked and has to wait for that job to finish. >>> >>> We have a case where we need the ganeti-watcher's periodic check to >>> restart some downed instances, but the ganeti-watcher itself gets stuck on >>> some other jobs and those downed instances don't get brought back up on >>> time (And this causes problems to us). >>> >>> There really is no reason to hold a lock to the watcher state file when >>> submitting the verify disk job anyway. >>> >> >> All this makes sense. My question is rather, if we end up with multiple >> watchers basically running (waiting) concurrently, can we end up with >> multiple (redundant) verify-group jobs? >> >> Sorry if I misunderstand the situation. >> >> iustin >> >> On 17 June 2016 at 17:18, Iustin Pop wrote: >>> 2016-06-17 8:31 GMT-07:00 'Federico Morg Pareschi' via ganeti-devel < ganeti-devel@googlegroups.com>: > The ganeti-watcher holds the group file lock for too long, until after > the execution of a group-verify-disk job. This locks for a long time if > there are other jobs already running and blocking the verify from > executing. When the lock is held, another ganeti-watcher run cannot be > scheduled, so this prevents the ganeti-watcher from running for several > minutes. > > With this commit, the lock is released before running the VerifyDisks > operation, so even if the submitted job gets stuck in the Job Queue, a > subsequient ganeti-watcher run would still happen. > Quick question: what prevents a runaway case where VerifyDisks is blocking for hours and we have many watchers all running and submitting their own VerifyDisks? thanks, iustin >>> >>> >> >
Re: [PATCH stable-2.17] Prevent watcher from holding lock on verify group
Yes, this is definitely the case. We have ideas to implement something in the future to obviate this problem, although we're still considering exactly how to implement it. This is a short-term quick fix to solve some blocker issues that show up as a consequence of it. On 17 June 2016 at 17:57, Iustin Popwrote: > 2016-06-17 9:46 GMT-07:00 Federico Pareschi : > >> When a ganeti-watcher runs on the nodegroup, it submits the verify-group >> job. If there is another job in the queue that is taking some locks that >> stop the verify-group job (like an instance creation) then the whole >> ganeti-watcher is blocked and has to wait for that job to finish. >> >> We have a case where we need the ganeti-watcher's periodic check to >> restart some downed instances, but the ganeti-watcher itself gets stuck on >> some other jobs and those downed instances don't get brought back up on >> time (And this causes problems to us). >> >> There really is no reason to hold a lock to the watcher state file when >> submitting the verify disk job anyway. >> > > All this makes sense. My question is rather, if we end up with multiple > watchers basically running (waiting) concurrently, can we end up with > multiple (redundant) verify-group jobs? > > Sorry if I misunderstand the situation. > > iustin > > On 17 June 2016 at 17:18, Iustin Pop wrote: >> >>> 2016-06-17 8:31 GMT-07:00 'Federico Morg Pareschi' via ganeti-devel < >>> ganeti-devel@googlegroups.com>: >>> The ganeti-watcher holds the group file lock for too long, until after the execution of a group-verify-disk job. This locks for a long time if there are other jobs already running and blocking the verify from executing. When the lock is held, another ganeti-watcher run cannot be scheduled, so this prevents the ganeti-watcher from running for several minutes. With this commit, the lock is released before running the VerifyDisks operation, so even if the submitted job gets stuck in the Job Queue, a subsequient ganeti-watcher run would still happen. >>> >>> Quick question: what prevents a runaway case where VerifyDisks is >>> blocking for hours and we have many watchers all running and submitting >>> their own VerifyDisks? >>> >>> thanks, >>> iustin >>> >> >> >
Re: [PATCH stable-2.17] Prevent watcher from holding lock on verify group
2016-06-17 9:46 GMT-07:00 Federico Pareschi: > When a ganeti-watcher runs on the nodegroup, it submits the verify-group > job. If there is another job in the queue that is taking some locks that > stop the verify-group job (like an instance creation) then the whole > ganeti-watcher is blocked and has to wait for that job to finish. > > We have a case where we need the ganeti-watcher's periodic check to > restart some downed instances, but the ganeti-watcher itself gets stuck on > some other jobs and those downed instances don't get brought back up on > time (And this causes problems to us). > > There really is no reason to hold a lock to the watcher state file when > submitting the verify disk job anyway. > All this makes sense. My question is rather, if we end up with multiple watchers basically running (waiting) concurrently, can we end up with multiple (redundant) verify-group jobs? Sorry if I misunderstand the situation. iustin On 17 June 2016 at 17:18, Iustin Pop wrote: > >> 2016-06-17 8:31 GMT-07:00 'Federico Morg Pareschi' via ganeti-devel < >> ganeti-devel@googlegroups.com>: >> >>> The ganeti-watcher holds the group file lock for too long, until after >>> the execution of a group-verify-disk job. This locks for a long time if >>> there are other jobs already running and blocking the verify from >>> executing. When the lock is held, another ganeti-watcher run cannot be >>> scheduled, so this prevents the ganeti-watcher from running for several >>> minutes. >>> >>> With this commit, the lock is released before running the VerifyDisks >>> operation, so even if the submitted job gets stuck in the Job Queue, a >>> subsequient ganeti-watcher run would still happen. >>> >> >> Quick question: what prevents a runaway case where VerifyDisks is >> blocking for hours and we have many watchers all running and submitting >> their own VerifyDisks? >> >> thanks, >> iustin >> > >
Re: [PATCH stable-2.17] Prevent watcher from holding lock on verify group
When a ganeti-watcher runs on the nodegroup, it submits the verify-group job. If there is another job in the queue that is taking some locks that stop the verify-group job (like an instance creation) then the whole ganeti-watcher is blocked and has to wait for that job to finish. We have a case where we need the ganeti-watcher's periodic check to restart some downed instances, but the ganeti-watcher itself gets stuck on some other jobs and those downed instances don't get brought back up on time (And this causes problems to us). There really is no reason to hold a lock to the watcher state file when submitting the verify disk job anyway. On 17 June 2016 at 17:18, Iustin Popwrote: > 2016-06-17 8:31 GMT-07:00 'Federico Morg Pareschi' via ganeti-devel < > ganeti-devel@googlegroups.com>: > >> The ganeti-watcher holds the group file lock for too long, until after >> the execution of a group-verify-disk job. This locks for a long time if >> there are other jobs already running and blocking the verify from >> executing. When the lock is held, another ganeti-watcher run cannot be >> scheduled, so this prevents the ganeti-watcher from running for several >> minutes. >> >> With this commit, the lock is released before running the VerifyDisks >> operation, so even if the submitted job gets stuck in the Job Queue, a >> subsequient ganeti-watcher run would still happen. >> > > Quick question: what prevents a runaway case where VerifyDisks is blocking > for hours and we have many watchers all running and submitting their own > VerifyDisks? > > thanks, > iustin >
Re: [PATCH stable-2.17] Prevent watcher from holding lock on verify group
2016-06-17 8:31 GMT-07:00 'Federico Morg Pareschi' via ganeti-devel < ganeti-devel@googlegroups.com>: > The ganeti-watcher holds the group file lock for too long, until after > the execution of a group-verify-disk job. This locks for a long time if > there are other jobs already running and blocking the verify from > executing. When the lock is held, another ganeti-watcher run cannot be > scheduled, so this prevents the ganeti-watcher from running for several > minutes. > > With this commit, the lock is released before running the VerifyDisks > operation, so even if the submitted job gets stuck in the Job Queue, a > subsequient ganeti-watcher run would still happen. > Quick question: what prevents a runaway case where VerifyDisks is blocking for hours and we have many watchers all running and submitting their own VerifyDisks? thanks, iustin
[PATCH stable-2.17] Prevent watcher from holding lock on verify group
The ganeti-watcher holds the group file lock for too long, until after the execution of a group-verify-disk job. This locks for a long time if there are other jobs already running and blocking the verify from executing. When the lock is held, another ganeti-watcher run cannot be scheduled, so this prevents the ganeti-watcher from running for several minutes. With this commit, the lock is released before running the VerifyDisks operation, so even if the submitted job gets stuck in the Job Queue, a subsequient ganeti-watcher run would still happen. As an additional change, an incorrect docstring was also removed. Signed-off-by: Federico Morg Pareschi--- lib/watcher/__init__.py | 31 --- lib/watcher/state.py| 2 +- 2 files changed, 17 insertions(+), 16 deletions(-) diff --git a/lib/watcher/__init__.py b/lib/watcher/__init__.py index 6b73dc0..4e946b3 100644 --- a/lib/watcher/__init__.py +++ b/lib/watcher/__init__.py @@ -844,7 +844,7 @@ def _GroupWatcher(opts): logging.debug("Using state file %s", state_path) - # Global watcher + # Group watcher file lock statefile = state.OpenStateFile(state_path) # pylint: disable=E0602 if not statefile: return constants.EXIT_FAILURE @@ -867,26 +867,27 @@ def _GroupWatcher(opts): started = _CheckInstances(client, notepad, instances, locks) _CheckDisks(client, notepad, nodes, instances, started) - -# Check if the nodegroup only has ext storage type -only_ext = compat.all(i.disk_template == constants.DT_EXT - for i in instances.values()) - -# We skip current NodeGroup verification if there are only external storage -# devices. Currently we provide an interface for external storage provider -# for disk verification implementations, however current ExtStorageDevice -# does not provide an API for this yet. -# -# This check needs to be revisited if ES_ACTION_VERIFY on ExtStorageDevice -# is implemented. -if not opts.no_verify_disks and not only_ext: - _VerifyDisks(client, group_uuid, nodes, instances) except Exception, err: logging.info("Not updating status file due to failure: %s", err) raise else: # Save changes for next run notepad.Save(state_path) +notepad.Close() + + # Check if the nodegroup only has ext storage type + only_ext = compat.all(i.disk_template == constants.DT_EXT +for i in instances.values()) + + # We skip current NodeGroup verification if there are only external storage + # devices. Currently we provide an interface for external storage provider + # for disk verification implementations, however current ExtStorageDevice + # does not provide an API for this yet. + # + # This check needs to be revisited if ES_ACTION_VERIFY on ExtStorageDevice + # is implemented. + if not opts.no_verify_disks and not only_ext: +_VerifyDisks(client, group_uuid, nodes, instances) return constants.EXIT_SUCCESS diff --git a/lib/watcher/state.py b/lib/watcher/state.py index 5c51b5b..b8ff4ef 100644 --- a/lib/watcher/state.py +++ b/lib/watcher/state.py @@ -111,7 +111,7 @@ class WatcherState(object): self._orig_data = serializer.Dump(self._data) def Save(self, filename): -"""Save state to file, then unlock and close it. +"""Save state to file. """ assert self.statefile -- 2.8.0.rc3.226.g39d4020