Re: [PATCH stable-2.16] Special case WaitForJobChange to reduce heap use

2016-06-17 Thread Iustin Pop
On 2016-06-15 17:25:18, Ganeti Development List wrote:
> As we talked offline, I like the fact that we are not going through the
> generic handleCall if we do not need the config regardless of the
> optimisation (which is awesome and is the main motivation for this patch).
> I'm wondering if we still need to keep in the expensive code path - I don't
> see a reason why someone would want to query other fields in WaitForJobChange
> than status or logs (I think these are the only ones that change during the
> lifetime of the job). Maybe we could drop the expensive code path in later
> versions.

I discussed this with Brian briefly over chat a while ago. I regard this
as a good but temporary stop-gap, as it shouldn't be needed to manually
handle routing differently. Better type system separation should allow
releasing the config if not needed automatically. It would be good to
invest into that, rather than (IMHO) expanding the manual workaround.

Just my 2c :)

iustin


[PATCH stable-2.15] KVM: handle gracefully too old/too new psutil versions

2016-06-17 Thread Iustin Pop
From: Iustin Pop 

My previous pylint cleanups were done without psutil installed; as soon
I installed it, pylint showed that the wheezy's version of psutil is too
old (0.5.1), not having the cpu_count() function which was introduced in
2.0.0. Furthermore, thanks to Brian, it turns out that too new versions
are also unsupported.

This change adds a simple way to detect this and to degrade gracefully
by throwing an appropriate exception instead of an unclear one at
runtime. Tested with wheezy's 0.5.0 and sid's 4.1.1 versions. It also
updates the INSTALL file to note the supported versions and that
[2.0.0,2.2.0) had issues with FH leaks.

Signed-off-by: Iustin Pop 
---
 INSTALL   |  4 +++-
 lib/hypervisor/hv_kvm/__init__.py | 17 +++--
 2 files changed, 18 insertions(+), 3 deletions(-)

diff --git a/INSTALL b/INSTALL
index e33ab96..95f3019 100644
--- a/INSTALL
+++ b/INSTALL
@@ -42,7 +42,9 @@ Before installing, please verify that you have the following 
programs:
 - `Paramiko `_, if you want to use
   ``ganeti-listrunner``
 - `psutil Python module `_,
-  optional python package for supporting CPU pinning under KVM
+  optional python package for supporting CPU pinning under KVM, versions
+  2.x.x only; beware that versions from 2.0.0 to before 2.2.0 had a
+  number of file handle leaks, so running at least 2.2.0 is advised
 - `fdsend Python module `_,
   optional Python package for supporting NIC hotplugging under KVM
 - `qemu-img `_, if you want to use ``ovfconverter``
diff --git a/lib/hypervisor/hv_kvm/__init__.py 
b/lib/hypervisor/hv_kvm/__init__.py
index 4df0246..8271e0e 100644
--- a/lib/hypervisor/hv_kvm/__init__.py
+++ b/lib/hypervisor/hv_kvm/__init__.py
@@ -45,7 +45,17 @@ import urllib2
 from bitarray import bitarray
 try:
   import psutil   # pylint: disable=F0401
+  if psutil.version_info < (2, 0, 0):
+# The psutil version seems too old, we ignore it
+psutil_err = "too old (2.x.x needed, %s found)" % psutil.__version__
+psutil = None
+  elif psutil.version_info >= (3,):
+psutil_err = "too new (2.x.x needed, %s found)" % psutil.__version__
+psutil = None
+  else:
+psutil_err = ""
 except ImportError:
+  psutil_err = "not found"
   psutil = None
 try:
   import fdsend   # pylint: disable=F0401
@@ -716,11 +726,14 @@ class KVMHypervisor(hv_base.BaseHypervisor):
 
 """
 if psutil is None:
-  raise errors.HypervisorError("psutil Python package not"
-   " found; cannot use CPU pinning under KVM")
+  raise errors.HypervisorError("psutil Python package %s"
+   "; cannot use CPU pinning"
+   " under KVM" % psutil_err)
 
 target_process = psutil.Process(process_id)
 if cpus == constants.CPU_PINNING_OFF:
+  # we checked this at import time
+  # pylint: disable=E1101
   target_process.set_cpu_affinity(range(psutil.cpu_count()))
 else:
   target_process.set_cpu_affinity(cpus)
-- 
2.8.1



Re: [PATCH stable-2.15 1/2] KVM: handle gracefully too old psutil versions

2016-06-17 Thread Iustin Pop
On 2016-06-15 10:23:57, Brian Foley wrote:
> On Wed, Jun 15, 2016 at 07:18:38AM +0200, Iustin Pop wrote:
> > From: Iustin Pop 
> > 
> > My previous pylint cleanups were done without psutil installed; as soon
> > I installed it, pylint showed that the wheezy's version of psutil is too
> > old (0.5.1), not having the cpu_count() function which was introduced in
> > 2.0.0.
> > 
> > This change adds a simple way to detect this and to degrade gracefully
> > by throwing an appropriate exception instead of an unclear one at
> > runtime.
> 
> This is good, but we could take it a little further. First of all we
> support any version 2.x version of psutil from 2.0.0 to 2.2.1. However
> we probably want to warn if the user is running <2.2.0 because 2.2.0
> fixed a number of file handle leaks.

Uh, that is hard to do properly, I don't know if many people watch
configure-time warnings… I'll add a note to INSTALL.

> Additionally, 0.5.0 had a psutil.Process.{get,set}_cpu_affinity() API,
> which we use in the kvm code. In 2.0.0 API was renamed to cpu_affinity()
> and the old API deprecated, and in 3.0 the old {get,set} API was removed.
> 
> So we currently need to restrict psutils to 2.x, but maybe it's just
> easier to add an __attr__ check for the new affinity API, but perhaps
> this should be done in a separate patch.

Just to see if I understand: we don't support 3.x+ due to the
cpu_affinity API only?

Patch coming shortly.

iustin


Re: [PATCH stable-2.17] Prevent watcher from holding lock on verify group

2016-06-17 Thread 'Iustin Pop' via ganeti-devel
On 17 June 2016 at 10:36, Federico Pareschi  wrote:

> Yes, this is definitely the case.
> We have ideas to implement something in the future to obviate this
> problem, although we're still considering exactly how to implement it.
> This is a short-term quick fix to solve some blocker issues that show up
> as a consequence of it.
>

Ack, thanks for the info!

iustin


> On 17 June 2016 at 17:57, Iustin Pop  wrote:
>
>> 2016-06-17 9:46 GMT-07:00 Federico Pareschi :
>>
>>> When a ganeti-watcher runs on the nodegroup, it submits the verify-group
>>> job. If there is another job in the queue that is taking some locks that
>>> stop the verify-group job (like an instance creation) then the whole
>>> ganeti-watcher is blocked and has to wait for that job to finish.
>>>
>>> We have a case where we need the ganeti-watcher's periodic check to
>>> restart some downed instances, but the ganeti-watcher itself gets stuck on
>>> some other jobs and those downed instances don't get brought back up on
>>> time (And this causes problems to us).
>>>
>>> There really is no reason to hold a lock to the watcher state file when
>>> submitting the verify disk job anyway.
>>>
>>
>> All this makes sense. My question is rather, if we end up with multiple
>> watchers basically running (waiting) concurrently, can we end up with
>> multiple (redundant) verify-group jobs?
>>
>> Sorry if I misunderstand the situation.
>>
>> iustin
>>
>> On 17 June 2016 at 17:18, Iustin Pop  wrote:
>>>
 2016-06-17 8:31 GMT-07:00 'Federico Morg Pareschi' via ganeti-devel <
 ganeti-devel@googlegroups.com>:

> The ganeti-watcher holds the group file lock for too long, until after
> the execution of a group-verify-disk job. This locks for a long time if
> there are other jobs already running and blocking the verify from
> executing. When the lock is held, another ganeti-watcher run cannot be
> scheduled, so this prevents the ganeti-watcher from running for several
> minutes.
>
> With this commit, the lock is released before running the VerifyDisks
> operation, so even if the submitted job gets stuck in the Job Queue, a
> subsequient ganeti-watcher run would still happen.
>

 Quick question: what prevents a runaway case where VerifyDisks is
 blocking for hours and we have many watchers all running and submitting
 their own VerifyDisks?

 thanks,
 iustin

>>>
>>>
>>
>


Re: [PATCH stable-2.17] Prevent watcher from holding lock on verify group

2016-06-17 Thread 'Federico Pareschi' via ganeti-devel
Yes, this is definitely the case.
We have ideas to implement something in the future to obviate this problem,
although we're still considering exactly how to implement it.
This is a short-term quick fix to solve some blocker issues that show up as
a consequence of it.

On 17 June 2016 at 17:57, Iustin Pop  wrote:

> 2016-06-17 9:46 GMT-07:00 Federico Pareschi :
>
>> When a ganeti-watcher runs on the nodegroup, it submits the verify-group
>> job. If there is another job in the queue that is taking some locks that
>> stop the verify-group job (like an instance creation) then the whole
>> ganeti-watcher is blocked and has to wait for that job to finish.
>>
>> We have a case where we need the ganeti-watcher's periodic check to
>> restart some downed instances, but the ganeti-watcher itself gets stuck on
>> some other jobs and those downed instances don't get brought back up on
>> time (And this causes problems to us).
>>
>> There really is no reason to hold a lock to the watcher state file when
>> submitting the verify disk job anyway.
>>
>
> All this makes sense. My question is rather, if we end up with multiple
> watchers basically running (waiting) concurrently, can we end up with
> multiple (redundant) verify-group jobs?
>
> Sorry if I misunderstand the situation.
>
> iustin
>
> On 17 June 2016 at 17:18, Iustin Pop  wrote:
>>
>>> 2016-06-17 8:31 GMT-07:00 'Federico Morg Pareschi' via ganeti-devel <
>>> ganeti-devel@googlegroups.com>:
>>>
 The ganeti-watcher holds the group file lock for too long, until after
 the execution of a group-verify-disk job. This locks for a long time if
 there are other jobs already running and blocking the verify from
 executing. When the lock is held, another ganeti-watcher run cannot be
 scheduled, so this prevents the ganeti-watcher from running for several
 minutes.

 With this commit, the lock is released before running the VerifyDisks
 operation, so even if the submitted job gets stuck in the Job Queue, a
 subsequient ganeti-watcher run would still happen.

>>>
>>> Quick question: what prevents a runaway case where VerifyDisks is
>>> blocking for hours and we have many watchers all running and submitting
>>> their own VerifyDisks?
>>>
>>> thanks,
>>> iustin
>>>
>>
>>
>


Re: [PATCH stable-2.17] Prevent watcher from holding lock on verify group

2016-06-17 Thread 'Iustin Pop' via ganeti-devel
2016-06-17 9:46 GMT-07:00 Federico Pareschi :

> When a ganeti-watcher runs on the nodegroup, it submits the verify-group
> job. If there is another job in the queue that is taking some locks that
> stop the verify-group job (like an instance creation) then the whole
> ganeti-watcher is blocked and has to wait for that job to finish.
>
> We have a case where we need the ganeti-watcher's periodic check to
> restart some downed instances, but the ganeti-watcher itself gets stuck on
> some other jobs and those downed instances don't get brought back up on
> time (And this causes problems to us).
>
> There really is no reason to hold a lock to the watcher state file when
> submitting the verify disk job anyway.
>

All this makes sense. My question is rather, if we end up with multiple
watchers basically running (waiting) concurrently, can we end up with
multiple (redundant) verify-group jobs?

Sorry if I misunderstand the situation.

iustin

On 17 June 2016 at 17:18, Iustin Pop  wrote:
>
>> 2016-06-17 8:31 GMT-07:00 'Federico Morg Pareschi' via ganeti-devel <
>> ganeti-devel@googlegroups.com>:
>>
>>> The ganeti-watcher holds the group file lock for too long, until after
>>> the execution of a group-verify-disk job. This locks for a long time if
>>> there are other jobs already running and blocking the verify from
>>> executing. When the lock is held, another ganeti-watcher run cannot be
>>> scheduled, so this prevents the ganeti-watcher from running for several
>>> minutes.
>>>
>>> With this commit, the lock is released before running the VerifyDisks
>>> operation, so even if the submitted job gets stuck in the Job Queue, a
>>> subsequient ganeti-watcher run would still happen.
>>>
>>
>> Quick question: what prevents a runaway case where VerifyDisks is
>> blocking for hours and we have many watchers all running and submitting
>> their own VerifyDisks?
>>
>> thanks,
>> iustin
>>
>
>


Re: [PATCH stable-2.17] Prevent watcher from holding lock on verify group

2016-06-17 Thread 'Federico Pareschi' via ganeti-devel
When a ganeti-watcher runs on the nodegroup, it submits the verify-group
job. If there is another job in the queue that is taking some locks that
stop the verify-group job (like an instance creation) then the whole
ganeti-watcher is blocked and has to wait for that job to finish.

We have a case where we need the ganeti-watcher's periodic check to restart
some downed instances, but the ganeti-watcher itself gets stuck on some
other jobs and those downed instances don't get brought back up on time
(And this causes problems to us).

There really is no reason to hold a lock to the watcher state file when
submitting the verify disk job anyway.

On 17 June 2016 at 17:18, Iustin Pop  wrote:

> 2016-06-17 8:31 GMT-07:00 'Federico Morg Pareschi' via ganeti-devel <
> ganeti-devel@googlegroups.com>:
>
>> The ganeti-watcher holds the group file lock for too long, until after
>> the execution of a group-verify-disk job. This locks for a long time if
>> there are other jobs already running and blocking the verify from
>> executing. When the lock is held, another ganeti-watcher run cannot be
>> scheduled, so this prevents the ganeti-watcher from running for several
>> minutes.
>>
>> With this commit, the lock is released before running the VerifyDisks
>> operation, so even if the submitted job gets stuck in the Job Queue, a
>> subsequient ganeti-watcher run would still happen.
>>
>
> Quick question: what prevents a runaway case where VerifyDisks is blocking
> for hours and we have many watchers all running and submitting their own
> VerifyDisks?
>
> thanks,
> iustin
>


Re: [PATCH stable-2.17] Prevent watcher from holding lock on verify group

2016-06-17 Thread 'Iustin Pop' via ganeti-devel
2016-06-17 8:31 GMT-07:00 'Federico Morg Pareschi' via ganeti-devel <
ganeti-devel@googlegroups.com>:

> The ganeti-watcher holds the group file lock for too long, until after
> the execution of a group-verify-disk job. This locks for a long time if
> there are other jobs already running and blocking the verify from
> executing. When the lock is held, another ganeti-watcher run cannot be
> scheduled, so this prevents the ganeti-watcher from running for several
> minutes.
>
> With this commit, the lock is released before running the VerifyDisks
> operation, so even if the submitted job gets stuck in the Job Queue, a
> subsequient ganeti-watcher run would still happen.
>

Quick question: what prevents a runaway case where VerifyDisks is blocking
for hours and we have many watchers all running and submitting their own
VerifyDisks?

thanks,
iustin


[PATCH stable-2.17] Prevent watcher from holding lock on verify group

2016-06-17 Thread 'Federico Morg Pareschi' via ganeti-devel
The ganeti-watcher holds the group file lock for too long, until after
the execution of a group-verify-disk job. This locks for a long time if
there are other jobs already running and blocking the verify from
executing. When the lock is held, another ganeti-watcher run cannot be
scheduled, so this prevents the ganeti-watcher from running for several
minutes.

With this commit, the lock is released before running the VerifyDisks
operation, so even if the submitted job gets stuck in the Job Queue, a
subsequient ganeti-watcher run would still happen.

As an additional change, an incorrect docstring was also removed.

Signed-off-by: Federico Morg Pareschi 
---
 lib/watcher/__init__.py | 31 ---
 lib/watcher/state.py|  2 +-
 2 files changed, 17 insertions(+), 16 deletions(-)

diff --git a/lib/watcher/__init__.py b/lib/watcher/__init__.py
index 6b73dc0..4e946b3 100644
--- a/lib/watcher/__init__.py
+++ b/lib/watcher/__init__.py
@@ -844,7 +844,7 @@ def _GroupWatcher(opts):
 
   logging.debug("Using state file %s", state_path)
 
-  # Global watcher
+  # Group watcher file lock
   statefile = state.OpenStateFile(state_path) # pylint: disable=E0602
   if not statefile:
 return constants.EXIT_FAILURE
@@ -867,26 +867,27 @@ def _GroupWatcher(opts):
 
 started = _CheckInstances(client, notepad, instances, locks)
 _CheckDisks(client, notepad, nodes, instances, started)
-
-# Check if the nodegroup only has ext storage type
-only_ext = compat.all(i.disk_template == constants.DT_EXT
-  for i in instances.values())
-
-# We skip current NodeGroup verification if there are only external storage
-# devices. Currently we provide an interface for external storage provider
-# for disk verification implementations, however current ExtStorageDevice
-# does not provide an API for this yet.
-#
-# This check needs to be revisited if ES_ACTION_VERIFY on ExtStorageDevice
-# is implemented.
-if not opts.no_verify_disks and not only_ext:
-  _VerifyDisks(client, group_uuid, nodes, instances)
   except Exception, err:
 logging.info("Not updating status file due to failure: %s", err)
 raise
   else:
 # Save changes for next run
 notepad.Save(state_path)
+notepad.Close()
+
+  # Check if the nodegroup only has ext storage type
+  only_ext = compat.all(i.disk_template == constants.DT_EXT
+for i in instances.values())
+
+  # We skip current NodeGroup verification if there are only external storage
+  # devices. Currently we provide an interface for external storage provider
+  # for disk verification implementations, however current ExtStorageDevice
+  # does not provide an API for this yet.
+  #
+  # This check needs to be revisited if ES_ACTION_VERIFY on ExtStorageDevice
+  # is implemented.
+  if not opts.no_verify_disks and not only_ext:
+_VerifyDisks(client, group_uuid, nodes, instances)
 
   return constants.EXIT_SUCCESS
 
diff --git a/lib/watcher/state.py b/lib/watcher/state.py
index 5c51b5b..b8ff4ef 100644
--- a/lib/watcher/state.py
+++ b/lib/watcher/state.py
@@ -111,7 +111,7 @@ class WatcherState(object):
 self._orig_data = serializer.Dump(self._data)
 
   def Save(self, filename):
-"""Save state to file, then unlock and close it.
+"""Save state to file.
 
 """
 assert self.statefile
-- 
2.8.0.rc3.226.g39d4020