Re: [ovs-dev] [PATCH v1] ovs-tcpdump: Cleanup mirror failed with twice fatal signals

Ilya Maximets Thu, 22 Feb 2024 17:35:38 -0800

On 2/22/24 14:12, Daniel Ding wrote:
> 
> 
> 
>> 2024年2月22日 下午7:20，Ilya Maximets <[email protected] 
>> <mailto:[email protected]>> 写道：
>>
>> On 2/22/24 11:53, Ilya Maximets wrote:
>>> On 2/22/24 11:02, Daniel Ding wrote:
>>>>
>>>>
>>>>> 2024年2月22日 上午1:55，Ilya Maximets <[email protected] 
>>>>> <mailto:[email protected]> <mailto:[email protected] 
>>>>> <mailto:[email protected]>>> 写道：
>>>>>
>>>>> On 2/21/24 15:27, Aaron Conole wrote:
>>>>>> Daniel Ding <[email protected] <mailto:[email protected]> 
>>>>>> <mailto:[email protected] <mailto:[email protected]>>> writes:
>>>>>>
>>>>>>> After running ovs-tcpdump and inputs multiple CTRL+C, the program will
>>>>>>> raise the following exception.
>>>>>>>
>>>>>>> Error in atexit._run_exitfuncs:
>>>>>>> Traceback (most recent call last):
>>>>>>>  File "/usr/bin/ovs-tcpdump", line 421, in cleanup_mirror
>>>>>>>    ovsdb = OVSDB(db_sock)
>>>>>>>  File "/usr/bin/ovs-tcpdump", line 168, in __init__
>>>>>>>    OVSDB.wait_for_db_change(self._idl_conn)  # Initial Sync with DB
>>>>>>>  File "/usr/bin/ovs-tcpdump", line 155, in wait_for_db_change
>>>>>>>    while idl.change_seqno == seq and not idl.run():
>>>>>>>
>>>>>>> Signed-off-by: Daniel Ding <[email protected] 
>>>>>>> <mailto:[email protected]> <mailto:[email protected] 
>>>>>>> <mailto:[email protected]>>>
>>>>>>> ---
>>>>>>
>>>>>> LGTM for the linux side - maybe Alin might check the windows side.
>>>>>
>>>>> The code is copied from the fatal-signla library, so it is likely fine.
>>>>> However, I don;t think this issue should be fixed on the application
>>>>> level (ovs-tcpdump).   There seems to be adifference in how C and python
>>>>> fatal-signla libraries behave.  The C version calls the hooks in the run()
>>>>> function and the signal handler only updates the signal number variable.
>>>>> So, if the same signal arrives multiple times it will be handled only once
>>>>> and the run() function will not exit until all the hooks are done, unless
>>>>> it is a segfault.
>>>>>
>>>>> With python implementation we're calling hooks right from the signal
>>>>> handler (it's not a real signal handler, so we're allowed to use not
>>>>> really singal-safe functions there).  So, if another signal arrives while
>>>>> we're executing the hooks, the second hook execution will be skipped
>>>>> due to 'recursion' check, but the signal will be re-raised immediately,
>>>>> killing the process.
>>>>
>>>> As your suggestions, I rewrite recurse as a threading.Lock. So that 
>>>> protect calling hooks
>>>> applications registered.
>>>>
>>>>>
>>>>> There is likley a way to fix that by using a mutex instead of recursion
>>>>> check and blocking it while executing the hooks.  The signal number
>>>>> will need to be stored in the signal handler and the signal will need
>>>>> to be re-raised in the call_hooks() instead of signal handler.
>>>>> We can use mutex.acquire(blocking=False) as a way to prevent recursion.
>>>>>
>>>>
>>>> To avoid breaking calling hooks, I set the signal restored in the signal 
>>>> handler to SIG_IGN in 
>>>> call_hooks.
>>>
>>> I'm not sure this is necessary, see below.
> 
> The default handler of SIGINT is default_int_handler, so it not register the 
> signal handler
> successfully. When received CTRL+C again, the program be break, and calling 
> hook can't
> execute completely.
> 
>>>
>>>> And move re-raised the signal in the call_hooks with locked.
>>>>
>>>>> Does that make sense?
>>>>>
>>>>> Best regards, Ilya Maximets.
>>>>
>>>> diff --git a/python/ovs/fatal_signal.py b/python/ovs/fatal_signal.py
>>>> index cb2e99e87..3aa56a166 100644
>>>> --- a/python/ovs/fatal_signal.py
>>>> +++ b/python/ovs/fatal_signal.py
>>>> @@ -16,6 +16,7 @@ import atexit
>>>>  import os
>>>>  import signal
>>>>  import sys
>>>> +import threading
>>>>
>>>>  import ovs.vlog
>>>>
>>>> @@ -112,28 +113,34 @@ def _unlink(file_):
>>>>  def _signal_handler(signr, _):
>>>>      _call_hooks(signr)
>>>>
>>>> -    # Re-raise the signal with the default handling so that the program
>>>> -    # termination status reflects that we were killed by this signal.
>>>> -    signal.signal(signr, signal.SIG_DFL)
>>>> -    os.kill(os.getpid(), signr)
>>>> -
>>>>
>>>>  def _atexit_handler():
>>>>      _call_hooks(0)
>>>>
>>>>
>>>> -recurse = False
>>>> +recurse = threading.Lock()
>>>
>>> It's better to rename this to just 'mutex' or 'lock'.
> 
> Done
> 
>>>
>>>>
>>>>
>>>>  def _call_hooks(signr):
>>>>      global recurse
>>>> -    if recurse:
>>>> +    if recurse.locked():
>>>>          return
>>>
>>> There is an awkward race window between checking that the mutex is
>>> locked and actually attempting to take it below.  It's probably not
>>> a big deal here, but it's better to take the lock at the same time
>>> with checking, e.g.:
>>>
>>>    if not mutex.acquire(blocking=False):
>>>        return
>>>
>>> And since we're going to end the process anyway (it's a handler
>>> for fatal signals after all), we don't actually need to unlock it,
>>> AFAICT.
> 
> Agree it.
> 
>>>
>>>> -    recurse = True
>>>>
>>>> -    for hook, cancel, run_at_exit in _hooks:
>>>> -        if signr != 0 or run_at_exit:
>>>> -            hook()
>>>> +    with recurse:
>>>> +        if signr != 0:
>>>> +            signal.signal(signr, signal.SIG_IGN)
>>>> +        else:
>>>> +            signal.signal(signal.SIGINT, signal.SIG_IGN)
>>>
>>> If another signal arrives while we're under the lock, the signal
>>> handler will call the _call_hooks(), check that the lock is already
>>> taken and return.  There is no need to ignore it.
>>> Also, if a different signal arrives the code above will not ignore
>>> it anyway.
>>>
> 
> You’re right, but the SIGINT is exception in the _init because its default 
> handler is not SIG_DFL.
> So I append a new condition for SIGINT to register the signal handler.
> 
>         for signr in signals:
>             handler = signal.getsignal(signr)
>             if (handler == signal.SIG_DFL or
>                 handler == signal.default_int_handler):
>                 signal.signal(signr, _signal_handler)


Hmm, yeah, makes sense.

With that, do we need to catch the KeyboardInterrupt exception
in the ovs-tcpdump now?

> 
> 
>>>> +
>>>> +        for hook, cancel, run_at_exit in _hooks:
>>>> +            if signr != 0 or run_at_exit:
>>>> +                hook()
>>>> +
>>>> +        if signr != 0:
>>>> +           # Re-raise the signal with the default handling so that the 
>>>> program
>>>> +           # termination status reflects that we were killed by this 
>>>> signal.
>>>> +           signal.signal(signr, signal.SIG_DFL)
>>>> +           os.kill(os.getpid(), signr)
>>>
>>> We may want to place this in the try: finally: block, so we still
>>> re-raise the signal even if one of the hooks throws an exception.
>>> Hooks should not generally throw exceptions, but may be safer this
>>> way.
>>
>> On a second thought, if we re-raise after the exception in the
>> hook, we may loose the exception.  So, letting the exception to
>> kill the application may be better in this case then hiding it
>> with re-raising a signal.
>>
>>>
>>> What do you think?
>>>
>>>>
>>>>
>>>>  _inited = False
>>>> @@ -165,7 +172,6 @@ def signal_alarm(timeout):
>>>>
>>>>      if sys.platform == "win32":
>>>>          import time
>>>> -        import threading
>>>>
>>>>          class Alarm (threading.Thread):
>>>>              def __init__(self, timeout):
>>>> --
>>>> 2.43.0
>>>>
>>>> Thanks Ilya Maximets, please help me review again.
>>>>
>>>> Best regards, Daniel Ding.
> 
> 
>  python/ovs/fatal_signal.py | 24 +++++++++++++-----------
>  1 file changed, 13 insertions(+), 11 deletions(-)
> 
> diff --git a/python/ovs/fatal_signal.py b/python/ovs/fatal_signal.py
> index cb2e99e87..077f50dd5 100644
> --- a/python/ovs/fatal_signal.py
> +++ b/python/ovs/fatal_signal.py
> @@ -16,6 +16,7 @@ import atexit
>  import os
>  import signal
>  import sys
> +import threading
> 
>  import ovs.vlog
> 
> @@ -112,29 +113,29 @@ def _unlink(file_):
>  def _signal_handler(signr, _):
>      _call_hooks(signr)
> 
> -    # Re-raise the signal with the default handling so that the program
> -    # termination status reflects that we were killed by this signal.
> -    signal.signal(signr, signal.SIG_DFL)
> -    os.kill(os.getpid(), signr)
> -
> 
>  def _atexit_handler():
>      _call_hooks(0)
> 
> 
> -recurse = False
> +mutex = threading.Lock()
> 
> 
>  def _call_hooks(signr):
> -    global recurse
> -    if recurse:
> +    global mutex
> +    if not mutex.acquire(blocking=False):
>          return
> -    recurse = True
> 
>      for hook, cancel, run_at_exit in _hooks:
>          if signr != 0 or run_at_exit:
>              hook()
> 
> +    if signr != 0:
> +       # Re-raise the signal with the default handling so that the program
> +       # termination status reflects that we were killed by this signal.
> +       signal.signal(signr, signal.SIG_DFL)
> +       os.kill(os.getpid(), signr)
> +
> 
>  _inited = False
> 
> @@ -150,7 +151,9 @@ def _init():
>                         signal.SIGALRM]
> 
>          for signr in signals:
> -            if signal.getsignal(signr) == signal.SIG_DFL:
> +            handler = signal.getsignal(signr)
> +            if (handler == signal.SIG_DFL or
> +                handler == signal.default_int_handler):
>                  signal.signal(signr, _signal_handler)
>          atexit.register(_atexit_handler)
> 
> @@ -165,7 +168,6 @@ def signal_alarm(timeout):
> 
>      if sys.platform == "win32":
>          import time
> -        import threading
> 
>          class Alarm (threading.Thread):
>              def __init__(self, timeout):
> 
> 
> 
> Thanks for your review, Ilya Maximets.

I left a comment above that may need to be addressed.
But otherwise, this version seems mostly fine.  

Could you please send it as a new PATCH v2 with adjusted
patch name and the commit message?  It will get tested and
it will be easier to continue reviews this way.

Thanks!

Best regards, Ilya Maximets.
_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH v1] ovs-tcpdump: Cleanup mirror failed with twice fatal signals

Reply via email to