subject:"\[PATCH stable\-2.17\] Prevent watcher from holding lock on verify group"

Re: [PATCH stable-2.17] Prevent watcher from holding lock on verify group

2016-06-21 Thread 'Brian Foley' via ganeti-devel

On Fri, Jun 17, 2016 at 04:31:55PM +0100, 'Federico Morg Pareschi' via 
ganeti-devel wrote:
> The ganeti-watcher holds the group file lock for too long, until after
> the execution of a group-verify-disk job. This locks for a long time if
> there are other jobs already running and blocking the verify from
> executing. When the lock is held, another ganeti-watcher run cannot be
> scheduled, so this prevents the ganeti-watcher from running for several
> minutes.
> 
> With this commit, the lock is released before running the VerifyDisks
> operation, so even if the submitted job gets stuck in the Job Queue, a
> subsequient ganeti-watcher run would still happen.
> 
> As an additional change, an incorrect docstring was also removed.
> 
> Signed-off-by: Federico Morg Pareschi 

OK, thanks for the patch and the discussion, and sorry for the delay in
reviewing this.

As you say, we'll have to figure out a good way to avoid multiple redundant
VerifyDisks running at once, but we can worry about that later, and for now this
LGTM.

Brian.

Re: [PATCH stable-2.17] Prevent watcher from holding lock on verify group

2016-06-17 Thread 'Iustin Pop' via ganeti-devel

On 17 June 2016 at 10:36, Federico Pareschi  wrote:

> Yes, this is definitely the case.
> We have ideas to implement something in the future to obviate this
> problem, although we're still considering exactly how to implement it.
> This is a short-term quick fix to solve some blocker issues that show up
> as a consequence of it.
>

Ack, thanks for the info!

iustin


> On 17 June 2016 at 17:57, Iustin Pop  wrote:
>
>> 2016-06-17 9:46 GMT-07:00 Federico Pareschi :
>>
>>> When a ganeti-watcher runs on the nodegroup, it submits the verify-group
>>> job. If there is another job in the queue that is taking some locks that
>>> stop the verify-group job (like an instance creation) then the whole
>>> ganeti-watcher is blocked and has to wait for that job to finish.
>>>
>>> We have a case where we need the ganeti-watcher's periodic check to
>>> restart some downed instances, but the ganeti-watcher itself gets stuck on
>>> some other jobs and those downed instances don't get brought back up on
>>> time (And this causes problems to us).
>>>
>>> There really is no reason to hold a lock to the watcher state file when
>>> submitting the verify disk job anyway.
>>>
>>
>> All this makes sense. My question is rather, if we end up with multiple
>> watchers basically running (waiting) concurrently, can we end up with
>> multiple (redundant) verify-group jobs?
>>
>> Sorry if I misunderstand the situation.
>>
>> iustin
>>
>> On 17 June 2016 at 17:18, Iustin Pop  wrote:
>>>
 2016-06-17 8:31 GMT-07:00 'Federico Morg Pareschi' via ganeti-devel <
 ganeti-devel@googlegroups.com>:

> The ganeti-watcher holds the group file lock for too long, until after
> the execution of a group-verify-disk job. This locks for a long time if
> there are other jobs already running and blocking the verify from
> executing. When the lock is held, another ganeti-watcher run cannot be
> scheduled, so this prevents the ganeti-watcher from running for several
> minutes.
>
> With this commit, the lock is released before running the VerifyDisks
> operation, so even if the submitted job gets stuck in the Job Queue, a
> subsequient ganeti-watcher run would still happen.
>

 Quick question: what prevents a runaway case where VerifyDisks is
 blocking for hours and we have many watchers all running and submitting
 their own VerifyDisks?

 thanks,
 iustin

>>>
>>>
>>
>

Re: [PATCH stable-2.17] Prevent watcher from holding lock on verify group

2016-06-17 Thread 'Federico Pareschi' via ganeti-devel

Yes, this is definitely the case.
We have ideas to implement something in the future to obviate this problem,
although we're still considering exactly how to implement it.
This is a short-term quick fix to solve some blocker issues that show up as
a consequence of it.

On 17 June 2016 at 17:57, Iustin Pop  wrote:

> 2016-06-17 9:46 GMT-07:00 Federico Pareschi :
>
>> When a ganeti-watcher runs on the nodegroup, it submits the verify-group
>> job. If there is another job in the queue that is taking some locks that
>> stop the verify-group job (like an instance creation) then the whole
>> ganeti-watcher is blocked and has to wait for that job to finish.
>>
>> We have a case where we need the ganeti-watcher's periodic check to
>> restart some downed instances, but the ganeti-watcher itself gets stuck on
>> some other jobs and those downed instances don't get brought back up on
>> time (And this causes problems to us).
>>
>> There really is no reason to hold a lock to the watcher state file when
>> submitting the verify disk job anyway.
>>
>
> All this makes sense. My question is rather, if we end up with multiple
> watchers basically running (waiting) concurrently, can we end up with
> multiple (redundant) verify-group jobs?
>
> Sorry if I misunderstand the situation.
>
> iustin
>
> On 17 June 2016 at 17:18, Iustin Pop  wrote:
>>
>>> 2016-06-17 8:31 GMT-07:00 'Federico Morg Pareschi' via ganeti-devel <
>>> ganeti-devel@googlegroups.com>:
>>>
 The ganeti-watcher holds the group file lock for too long, until after
 the execution of a group-verify-disk job. This locks for a long time if
 there are other jobs already running and blocking the verify from
 executing. When the lock is held, another ganeti-watcher run cannot be
 scheduled, so this prevents the ganeti-watcher from running for several
 minutes.

 With this commit, the lock is released before running the VerifyDisks
 operation, so even if the submitted job gets stuck in the Job Queue, a
 subsequient ganeti-watcher run would still happen.

>>>
>>> Quick question: what prevents a runaway case where VerifyDisks is
>>> blocking for hours and we have many watchers all running and submitting
>>> their own VerifyDisks?
>>>
>>> thanks,
>>> iustin
>>>
>>
>>
>

Re: [PATCH stable-2.17] Prevent watcher from holding lock on verify group

2016-06-17 Thread 'Iustin Pop' via ganeti-devel

2016-06-17 9:46 GMT-07:00 Federico Pareschi :

> When a ganeti-watcher runs on the nodegroup, it submits the verify-group
> job. If there is another job in the queue that is taking some locks that
> stop the verify-group job (like an instance creation) then the whole
> ganeti-watcher is blocked and has to wait for that job to finish.
>
> We have a case where we need the ganeti-watcher's periodic check to
> restart some downed instances, but the ganeti-watcher itself gets stuck on
> some other jobs and those downed instances don't get brought back up on
> time (And this causes problems to us).
>
> There really is no reason to hold a lock to the watcher state file when
> submitting the verify disk job anyway.
>

All this makes sense. My question is rather, if we end up with multiple
watchers basically running (waiting) concurrently, can we end up with
multiple (redundant) verify-group jobs?

Sorry if I misunderstand the situation.

iustin

On 17 June 2016 at 17:18, Iustin Pop  wrote:
>
>> 2016-06-17 8:31 GMT-07:00 'Federico Morg Pareschi' via ganeti-devel <
>> ganeti-devel@googlegroups.com>:
>>
>>> The ganeti-watcher holds the group file lock for too long, until after
>>> the execution of a group-verify-disk job. This locks for a long time if
>>> there are other jobs already running and blocking the verify from
>>> executing. When the lock is held, another ganeti-watcher run cannot be
>>> scheduled, so this prevents the ganeti-watcher from running for several
>>> minutes.
>>>
>>> With this commit, the lock is released before running the VerifyDisks
>>> operation, so even if the submitted job gets stuck in the Job Queue, a
>>> subsequient ganeti-watcher run would still happen.
>>>
>>
>> Quick question: what prevents a runaway case where VerifyDisks is
>> blocking for hours and we have many watchers all running and submitting
>> their own VerifyDisks?
>>
>> thanks,
>> iustin
>>
>
>

Re: [PATCH stable-2.17] Prevent watcher from holding lock on verify group

2016-06-17 Thread 'Federico Pareschi' via ganeti-devel

When a ganeti-watcher runs on the nodegroup, it submits the verify-group
job. If there is another job in the queue that is taking some locks that
stop the verify-group job (like an instance creation) then the whole
ganeti-watcher is blocked and has to wait for that job to finish.

We have a case where we need the ganeti-watcher's periodic check to restart
some downed instances, but the ganeti-watcher itself gets stuck on some
other jobs and those downed instances don't get brought back up on time
(And this causes problems to us).

There really is no reason to hold a lock to the watcher state file when
submitting the verify disk job anyway.

On 17 June 2016 at 17:18, Iustin Pop  wrote:

> 2016-06-17 8:31 GMT-07:00 'Federico Morg Pareschi' via ganeti-devel <
> ganeti-devel@googlegroups.com>:
>
>> The ganeti-watcher holds the group file lock for too long, until after
>> the execution of a group-verify-disk job. This locks for a long time if
>> there are other jobs already running and blocking the verify from
>> executing. When the lock is held, another ganeti-watcher run cannot be
>> scheduled, so this prevents the ganeti-watcher from running for several
>> minutes.
>>
>> With this commit, the lock is released before running the VerifyDisks
>> operation, so even if the submitted job gets stuck in the Job Queue, a
>> subsequient ganeti-watcher run would still happen.
>>
>
> Quick question: what prevents a runaway case where VerifyDisks is blocking
> for hours and we have many watchers all running and submitting their own
> VerifyDisks?
>
> thanks,
> iustin
>

Re: [PATCH stable-2.17] Prevent watcher from holding lock on verify group

2016-06-17 Thread 'Iustin Pop' via ganeti-devel

2016-06-17 8:31 GMT-07:00 'Federico Morg Pareschi' via ganeti-devel <
ganeti-devel@googlegroups.com>:

> The ganeti-watcher holds the group file lock for too long, until after
> the execution of a group-verify-disk job. This locks for a long time if
> there are other jobs already running and blocking the verify from
> executing. When the lock is held, another ganeti-watcher run cannot be
> scheduled, so this prevents the ganeti-watcher from running for several
> minutes.
>
> With this commit, the lock is released before running the VerifyDisks
> operation, so even if the submitted job gets stuck in the Job Queue, a
> subsequient ganeti-watcher run would still happen.
>

Quick question: what prevents a runaway case where VerifyDisks is blocking
for hours and we have many watchers all running and submitting their own
VerifyDisks?

thanks,
iustin

[PATCH stable-2.17] Prevent watcher from holding lock on verify group

2016-06-17 Thread 'Federico Morg Pareschi' via ganeti-devel

The ganeti-watcher holds the group file lock for too long, until after
the execution of a group-verify-disk job. This locks for a long time if
there are other jobs already running and blocking the verify from
executing. When the lock is held, another ganeti-watcher run cannot be
scheduled, so this prevents the ganeti-watcher from running for several
minutes.

With this commit, the lock is released before running the VerifyDisks
operation, so even if the submitted job gets stuck in the Job Queue, a
subsequient ganeti-watcher run would still happen.

As an additional change, an incorrect docstring was also removed.

Signed-off-by: Federico Morg Pareschi 
---
 lib/watcher/__init__.py | 31 ---
 lib/watcher/state.py|  2 +-
 2 files changed, 17 insertions(+), 16 deletions(-)

diff --git a/lib/watcher/__init__.py b/lib/watcher/__init__.py
index 6b73dc0..4e946b3 100644
--- a/lib/watcher/__init__.py
+++ b/lib/watcher/__init__.py
@@ -844,7 +844,7 @@ def _GroupWatcher(opts):
 
   logging.debug("Using state file %s", state_path)
 
-  # Global watcher
+  # Group watcher file lock
   statefile = state.OpenStateFile(state_path) # pylint: disable=E0602
   if not statefile:
 return constants.EXIT_FAILURE
@@ -867,26 +867,27 @@ def _GroupWatcher(opts):
 
 started = _CheckInstances(client, notepad, instances, locks)
 _CheckDisks(client, notepad, nodes, instances, started)
-
-# Check if the nodegroup only has ext storage type
-only_ext = compat.all(i.disk_template == constants.DT_EXT
-  for i in instances.values())
-
-# We skip current NodeGroup verification if there are only external storage
-# devices. Currently we provide an interface for external storage provider
-# for disk verification implementations, however current ExtStorageDevice
-# does not provide an API for this yet.
-#
-# This check needs to be revisited if ES_ACTION_VERIFY on ExtStorageDevice
-# is implemented.
-if not opts.no_verify_disks and not only_ext:
-  _VerifyDisks(client, group_uuid, nodes, instances)
   except Exception, err:
 logging.info("Not updating status file due to failure: %s", err)
 raise
   else:
 # Save changes for next run
 notepad.Save(state_path)
+notepad.Close()
+
+  # Check if the nodegroup only has ext storage type
+  only_ext = compat.all(i.disk_template == constants.DT_EXT
+for i in instances.values())
+
+  # We skip current NodeGroup verification if there are only external storage
+  # devices. Currently we provide an interface for external storage provider
+  # for disk verification implementations, however current ExtStorageDevice
+  # does not provide an API for this yet.
+  #
+  # This check needs to be revisited if ES_ACTION_VERIFY on ExtStorageDevice
+  # is implemented.
+  if not opts.no_verify_disks and not only_ext:
+_VerifyDisks(client, group_uuid, nodes, instances)
 
   return constants.EXIT_SUCCESS
 
diff --git a/lib/watcher/state.py b/lib/watcher/state.py
index 5c51b5b..b8ff4ef 100644
--- a/lib/watcher/state.py
+++ b/lib/watcher/state.py
@@ -111,7 +111,7 @@ class WatcherState(object):
 self._orig_data = serializer.Dump(self._data)
 
   def Save(self, filename):
-"""Save state to file, then unlock and close it.
+"""Save state to file.
 
 """
 assert self.statefile
-- 
2.8.0.rc3.226.g39d4020

Re: [PATCH stable-2.17] Prevent watcher from holding lock on verify group

Re: [PATCH stable-2.17] Prevent watcher from holding lock on verify group

Re: [PATCH stable-2.17] Prevent watcher from holding lock on verify group

Re: [PATCH stable-2.17] Prevent watcher from holding lock on verify group

Re: [PATCH stable-2.17] Prevent watcher from holding lock on verify group

Re: [PATCH stable-2.17] Prevent watcher from holding lock on verify group

[PATCH stable-2.17] Prevent watcher from holding lock on verify group

7 matches

Site Navigation

Mail list logo

Footer information