Re[2]: wait skips signals but first one
Hello again, configure log says: checking if getcwd() will dynamically allocate memory with 0 size... (cached) yes checking for presence of POSIX-style sigsetjmp/siglongjmp... (cached) missing checking whether or not strcoll and strcmp differ... (cached) no This is most likelly the problem. Note 1: forgot to mention that I'm cross-compiling. Note 2: it probably makes sense to add a warning or something that states that HAVE_POSIX_SIGSETJMP disabled due to cross-compiling. Will try to find a way to fix this. Thank you for your time! You are doing a great job! M 5 февраля 2024, 16:28:36, от "Chet Ramey" : On 2/3/24 7:01 PM, Mykyta Dorokhin wrote: > There is a line in trap.c with your change. If I revert it then everything > works again: > > - if (interrupt_immediately && wait_intr_flag) > + if (/* interrupt_immediately && */wait_intr_flag) > > So if I put interrupt_immediately back and rebuild the code with thes only > fix then it starts working properly, signals are getting received as expected. OK. Let's look at that. By this time, interrupt_immediately was no longer set anywhere, so the code before this change did nothing but inhibit the siglongjmp/longjmp call from trap_handler, which means the sighandler returned and (possibly) did not interrupt the wait builtin. That is what this means (replace SIGINT with SIGUSR1 here): > The one change that might make a difference is a bug fix: if the wait > builtin is waiting for a process and receives a trapped signal, it's > supposed to cause wait to return immediately and then run the trap. Bash > didn't do that consistently for SIGINT, and would run the trap when it > shouldn't, or before it should, and sometimes not return from the wait > at all. So maybe the longjmp back to the wait builtin is what changed > things, even though longjmp is one of the functions that POSIX says is > safe to call from a signal handler context, and it restores the signal > mask if you're running on a system that has sigsetjmp/siglongjmp. So the effect of this change is to longjmp/siglongjmp back to the wait builtin, so it can return from there before running the trap. If you use siglongjmp, it restores the original signal mask (look at the wait builtin's call to setjmp_sigs, a macro that calls sigsetjmp with 1 as the second argument), which means the trapped signal is no longer blocked. Since this works as intended on all other systems, I would check to see if your system has sigsetjmp/siglongjmp and whether or not they are behaving correctly. -- ``The lyf so short, the craft so long to lerne.'' - Chaucer ``Ars longa, vita brevis'' - Hippocrates Chet Ramey, UTech, CWRUc...@case.eduhttp://tiswww.cwru.edu/~chet/
Re[2]: wait skips signals but first one
Hello, Again, I'm on ("commit bash-20200221 snapshot") commit, the one I think breaks things: https://git.savannah.gnu.org/cgit/bash.git/commit/?h=devel=0df4ddca3f371bc258fe4185cdec36fce3e7be7b There is a line in trap.c with your change. If I revert it then everything works again: - if (interrupt_immediately && wait_intr_flag) + if (/* interrupt_immediately && */wait_intr_flag) So if I put interrupt_immediately back and rebuild the code with thes only fix then it starts working properly, signals are getting received as expected. Can you comment? Maybe you want me to provide some additional debug info? Thank you, Mykyta 3 февраля 2024, 22:09:33, от "Chet Ramey" : On 2/3/24 10:00 AM, Mykyta Dorokhin wrote: > I have found the commit on devel branch which breaks things for me (and > probably other Yocto-based builds): > > This one still works > == > > commit 89d788fb0152724a93e0fdab8c15116e5c76572b > Author: Chet Ramey > Date: Mon Feb 17 11:41:35 2020 -0500 > > commit bash-20200214 snapshot > > This one not > == > > > commit 0df4ddca3f371bc258fe4185cdec36fce3e7be7b > Author: Chet Ramey > Date: Mon Feb 24 10:41:37 2020 -0500 > > commit bash-20200221 snapshot > > > > Please take a look. Maybe you'll notice something suspicious there. I don't > know... uninitialized variables, endian-dependent code, etc. There are changes there, of course, but it's hard to see how they make a difference. The wait builtin was changed not to interrupt the wait for a trapped SIGCHLD, but to delay running any SIGCHLD trap until the wait exited. Since your example doesn't trap SIGCHLD, it doesn't seem significant. Any other trapped signal still interrupts the wait. Subshells clear the process substitution FIFO list, but you're not using process substitution. The one change that might make a difference is a bug fix: if the wait builtin is waiting for a process and receives a trapped signal, it's supposed to cause wait to return immediately and then run the trap. Bash didn't do that consistently for SIGINT, and would run the trap when it shouldn't, or before it should, and sometimes not return from the wait at all. So maybe the longjmp back to the wait builtin is what changed things, even though longjmp is one of the functions that POSIX says is safe to call from a signal handler context, and it restores the signal mask if you're running on a system that has sigsetjmp/siglongjmp. -- ``The lyf so short, the craft so long to lerne.'' - Chaucer ``Ars longa, vita brevis'' - Hippocrates Chet Ramey, UTech, CWRUc...@case.eduhttp://tiswww.cwru.edu/~chet/
Re[2]: wait skips signals but first one
Hello again, Here is another analysis that my collegue made on the issue: Bash Compiled for wrong OS? Analysis with strace. After receiving SIGUSR1, Debian only blocks SIGCHLD, then clears the block: 205295 --- SIGUSR1 {si_signo=SIGUSR1, si_code=SI_USER, si_pid=205327, si_uid=1040} --- 205295 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 205295 rt_sigaction(SIGINT, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7f063bdb5fd0}, {sa_handler=0x5637247940b0, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7f063bdb5fd0}, 8) = 0 205295 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0 205295 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 # unblocks all signalas The above is the correct action. On our device, it blocks SIGUSR1 as well as SIGCHLD and keeps doing it over and over again: 6707 --- SIGUSR1 {si_signo=SIGUSR1, si_code=SI_USER, si_pid=6724, si_uid=0} --- 6707 rt_sigprocmask(SIG_BLOCK, [CHLD], [USR1 CHLD], 8) = 0 6707 rt_sigprocmask(SIG_SETMASK, [USR1 CHLD], NULL, 8) = 0 6707 rt_sigprocmask(SIG_BLOCK, NULL, [USR1 CHLD], 8) = 0 6707 write(1, ">>> TRAPPED USR1 <<<\n", 21) = 21 6707 rt_sigprocmask(SIG_BLOCK, [CHLD], [USR1 CHLD], 8) = 0 6707 rt_sigprocmask(SIG_SETMASK, [USR1 CHLD], NULL, 8) = 0 6707 rt_sigprocmask(SIG_BLOCK, [CHLD], [USR1 CHLD], 8) = 0 6707 rt_sigprocmask(SIG_SETMASK, [USR1 CHLD], NULL, 8) = 0 6707 write(1, "Iteration\n", 10) = 10 6707 rt_sigprocmask(SIG_BLOCK, NULL, [USR1 CHLD], 8) = 0 6707 rt_sigprocmask(SIG_BLOCK, [INT TERM CHLD], [USR1 CHLD], 8) = 0 6707 clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x76fe9028) = 6725 6707 rt_sigprocmask(SIG_SETMASK, [USR1 CHLD], NULL, 8) = 0 6707 rt_sigprocmask(SIG_BLOCK, [CHLD], [USR1 CHLD], 8) = 0 6707 rt_sigprocmask(SIG_BLOCK, [CHLD], [USR1 CHLD], 8) = 0 6707 rt_sigprocmask(SIG_SETMASK, [USR1 CHLD], NULL, 8) = 0 6707 rt_sigprocmask(SIG_BLOCK, [CHLD], [USR1 CHLD], 8) = 0 6707 rt_sigprocmask(SIG_SETMASK, [USR1 CHLD], NULL, 8) = 0 6707 rt_sigprocmask(SIG_BLOCK, [CHLD], [USR1 CHLD], 8) = 0 6707 rt_sigprocmask(SIG_SETMASK, [USR1 CHLD], NULL, 8) = 0 6707 rt_sigprocmask(SIG_SETMASK, [USR1 CHLD], NULL, 8) = 0 6707 rt_sigprocmask(SIG_BLOCK, [CHLD], [USR1 CHLD], 8) = 0 6707 rt_sigprocmask(SIG_SETMASK, [USR1 CHLD], NULL, 8) = 0 6707 rt_sigprocmask(SIG_BLOCK, [CHLD], [USR1 CHLD], 8) = 0 6707 rt_sigaction(SIGINT, {sa_handler=0x46e15, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x76e90711}, 6707 <... rt_sigaction resumed>{sa_handler=0x46e15, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x76e90711}, 8) = 0 6707 wait4(-1, On modern systems, the OS blocks the signal that is caught during signal handling, and unblocks so that signal handlers are not called recursively. The exception to this is if SA_NODEFER is set. On some very old UNIX systems you had to block the signal yourself, and there was a small window where things could go wrong. I suspect BASH probably has a build option to allow blocking signals in handlers for compatibility with other systems, and is not being built correctly for Linux. I suspect on those very old systems the signal was automatically unblocked on return, but is not done here, because the POSIX sigprocmask is called, which requires calling it again to unblock the signal in Linux. And since wait is restarted, it never is unblocked. According to strace no additional user flags are set when the BASH signal handler is put in place for SIGUSR1. We need to look at bash build options, and possible the signal handling code, and sigprocmask or whatever C API they are using to call sigprocmask().
Re[2]: wait skips signals but first one
Like you, I can't reproduce it on the desktop platforms I have available right now. The bash devel git branch has fairly fine granularity. If you can automate the signal sending somewhat, maybe by having a child process send signals to $$, you could use your script and `git bisect' to find the commit where the behavior changed. bash-5.0 was frozen 12/31/2018, and bash-5.1 was frozen 12/14/2020, so that should get you started with the devel branch commits you want to inspect. http://git.savannah.gnu.org/cgit/bash.git/log/?h=devel I have found the commit on devel branch which breaks things for me (and probably other Yocto-based builds): This one still works == commit 89d788fb0152724a93e0fdab8c15116e5c76572b Author: Chet Ramey Date: Mon Feb 17 11:41:35 2020 -0500 commit bash-20200214 snapshot This one not == commit 0df4ddca3f371bc258fe4185cdec36fce3e7be7b Author: Chet Ramey Date: Mon Feb 24 10:41:37 2020 -0500 commit bash-20200221 snapshot Please take a look. Maybe you'll notice something suspicious there. I don't know... uninitialized variables, endian-dependent code, etc. Thank you, Mykyta