Re: [ovs-dev] [PATCH v1] ovs-tcpdump: Cleanup mirror failed with twice fatal signals

Ilya Maximets Thu, 22 Feb 2024 02:53:01 -0800

On 2/22/24 11:02, Daniel Ding wrote:
> 
> 
>> 2024年2月22日 上午1:55，Ilya Maximets <[email protected] 
>> <mailto:[email protected]>> 写道：
>>
>> On 2/21/24 15:27, Aaron Conole wrote:
>>> Daniel Ding <[email protected] <mailto:[email protected]>> writes:
>>>
>>>> After running ovs-tcpdump and inputs multiple CTRL+C, the program will
>>>> raise the following exception.
>>>>
>>>> Error in atexit._run_exitfuncs:
>>>> Traceback (most recent call last):
>>>>  File "/usr/bin/ovs-tcpdump", line 421, in cleanup_mirror
>>>>    ovsdb = OVSDB(db_sock)
>>>>  File "/usr/bin/ovs-tcpdump", line 168, in __init__
>>>>    OVSDB.wait_for_db_change(self._idl_conn)  # Initial Sync with DB
>>>>  File "/usr/bin/ovs-tcpdump", line 155, in wait_for_db_change
>>>>    while idl.change_seqno == seq and not idl.run():
>>>>
>>>> Signed-off-by: Daniel Ding <[email protected] 
>>>> <mailto:[email protected]>>
>>>> ---
>>>
>>> LGTM for the linux side - maybe Alin might check the windows side.
>>
>> The code is copied from the fatal-signla library, so it is likely fine.
>> However, I don;t think this issue should be fixed on the application
>> level (ovs-tcpdump).   There seems to be adifference in how C and python
>> fatal-signla libraries behave.  The C version calls the hooks in the run()
>> function and the signal handler only updates the signal number variable.
>> So, if the same signal arrives multiple times it will be handled only once
>> and the run() function will not exit until all the hooks are done, unless
>> it is a segfault.
>>
>> With python implementation we're calling hooks right from the signal
>> handler (it's not a real signal handler, so we're allowed to use not
>> really singal-safe functions there).  So, if another signal arrives while
>> we're executing the hooks, the second hook execution will be skipped
>> due to 'recursion' check, but the signal will be re-raised immediately,
>> killing the process.
> 
> As your suggestions, I rewrite recurse as a threading.Lock. So that protect 
> calling hooks
> applications registered.
> 
>>
>> There is likley a way to fix that by using a mutex instead of recursion
>> check and blocking it while executing the hooks.  The signal number
>> will need to be stored in the signal handler and the signal will need
>> to be re-raised in the call_hooks() instead of signal handler.
>> We can use mutex.acquire(blocking=False) as a way to prevent recursion.
>>
> 
> To avoid breaking calling hooks, I set the signal restored in the signal 
> handler to SIG_IGN in 
> call_hooks.


I'm not sure this is necessary, see below.

> And move re-raised the signal in the call_hooks with locked.
> 
>> Does that make sense?
>>
>> Best regards, Ilya Maximets.
> 
> diff --git a/python/ovs/fatal_signal.py b/python/ovs/fatal_signal.py
> index cb2e99e87..3aa56a166 100644
> --- a/python/ovs/fatal_signal.py
> +++ b/python/ovs/fatal_signal.py
> @@ -16,6 +16,7 @@ import atexit
>  import os
>  import signal
>  import sys
> +import threading
> 
>  import ovs.vlog
> 
> @@ -112,28 +113,34 @@ def _unlink(file_):
>  def _signal_handler(signr, _):
>      _call_hooks(signr)
> 
> -    # Re-raise the signal with the default handling so that the program
> -    # termination status reflects that we were killed by this signal.
> -    signal.signal(signr, signal.SIG_DFL)
> -    os.kill(os.getpid(), signr)
> -
> 
>  def _atexit_handler():
>      _call_hooks(0)
> 
> 
> -recurse = False
> +recurse = threading.Lock()

It's better to rename this to just 'mutex' or 'lock'.

> 
> 
>  def _call_hooks(signr):
>      global recurse
> -    if recurse:
> +    if recurse.locked():
>          return

There is an awkward race window between checking that the mutex is
locked and actually attempting to take it below.  It's probably not
a big deal here, but it's better to take the lock at the same time
with checking, e.g.:

    if not mutex.acquire(blocking=False):
        return

And since we're going to end the process anyway (it's a handler
for fatal signals after all), we don't actually need to unlock it,
AFAICT.

> -    recurse = True
> 
> -    for hook, cancel, run_at_exit in _hooks:
> -        if signr != 0 or run_at_exit:
> -            hook()
> +    with recurse:
> +        if signr != 0:
> +            signal.signal(signr, signal.SIG_IGN)
> +        else:
> +            signal.signal(signal.SIGINT, signal.SIG_IGN)

If another signal arrives while we're under the lock, the signal
handler will call the _call_hooks(), check that the lock is already
taken and return.  There is no need to ignore it.
Also, if a different signal arrives the code above will not ignore
it anyway.

> +
> +        for hook, cancel, run_at_exit in _hooks:
> +            if signr != 0 or run_at_exit:
> +                hook()
> +
> +        if signr != 0:
> +           # Re-raise the signal with the default handling so that the 
> program
> +           # termination status reflects that we were killed by this signal.
> +           signal.signal(signr, signal.SIG_DFL)
> +           os.kill(os.getpid(), signr)

We may want to place this in the try: finally: block, so we still
re-raise the signal even if one of the hooks throws an exception.
Hooks should not generally throw exceptions, but may be safer this
way.

What do you think?

> 
> 
>  _inited = False
> @@ -165,7 +172,6 @@ def signal_alarm(timeout):
> 
>      if sys.platform == "win32":
>          import time
> -        import threading
> 
>          class Alarm (threading.Thread):
>              def __init__(self, timeout):
> --
> 2.43.0
> 
> Thanks Ilya Maximets, please help me review again.
> 
> Best regards, Daniel Ding.
> 

_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH v1] ovs-tcpdump: Cleanup mirror failed with twice fatal signals

Reply via email to