On 4 June 2013 02:24, Roland Mainz <[email protected]> wrote: > On Mon, Jun 3, 2013 at 3:22 PM, Roland Mainz <[email protected]> wrote: >> On Sat, Jun 1, 2013 at 4:17 AM, Cedric Blancher >> <[email protected]> wrote: >>> I've tried to use the realtime signals in ast-ksh 20130524 to see if >>> they are reliable how but ran into instabilities again. > [snip] >>> >>> More than half of the signals are lost (which is a violation of the >>> POSIX realtime spec which mandates that realtime signals must be >>> reliable) and some of the array entries aren't even filled out (like >>> rtar[14][0].msg is missing). >> >> Grumpf... I spend half the weekend digging and debugging the signal >> handling code. Basically the issue is that signal+signal trap handling >> in ast-ksh.2013-05-24 is IMO utterly *broken* - while testing I found >> that SIGCHLD and SIGRTMIN signals are delivered _reliably_ to the >> shell but the implementation then looses between 8%-50% (measured on >> SuSE 12.3/Linux/AMD64 and Solaris 11) during signal trap processing. >> >> Example: >> I applied the following test patch to ast-ksh.2013-05-24: >> -- snip -- > [snip] >> -- snip -- >> >> >> Building the modified ksh93 version and running the script returns the >> following debug output: >> -- snip -- >> $ arch/linux.i386-64/bin/ksh ~/tmp/shrttest1.sh >/dev/null >> #>>>>>>>>>>>>>>>>>>>> sum: numsigrt_received=30, numsigrt_processed=17 >> -- snip -- >> >> The output is more or less self-describing: 30 SIGRTMIN signals have >> been received but only 17 have been processed... which should not >> happen. IMO the trap for RTMIN should be called 30 times - one time >> for each SIGRTMIN signal received. >> >> The issue is not limited to RTMIN-RTMAX signal trap handling (for >> example SIGCHLD handling has similar problems... which are basically >> only avoided because each received SIGCHLD signal causes the shell to >> probe all registered child processes) ... the whole shell signal trap >> handling is IMO doomed (e.g. the same general issues apply to "bash" >> and "dash") as long as... >> 1. ... it happens on the signal stack itself >> 2. ... C signal handlers execute shell code. The result is usually >> silent data corruption within the shell. I call it "silent" since the >> symptoms are usually not visible from the shell level unless a lot of >> signals arrive and the shell scripts run long enough so the damage can >> accumulate >> 3. ... C signal handlers can poke around in every aspect of the >> |Shell_t| structure. >> 4. ... the code assumes that C statements like |sig &= ~SH_TRAP| are >> an atomic operation (OK... wrong statement used as example) and can't >> be interrupted. >> On most RISC platforms this statement results in 3-5 instructions... >> where another signal may interrupt between any of those 3-5 >> instructions. This kind of issue can currently strike _everywhere_ in >> the code and leads to corruption issues when complex variables types >> like indexed arrays, associative arrays or compound variables are >> processed (Cedric's demo is a good example because multiple signal >> handlers work on the same 2D indexed array). >> 5. ... interleaving signal handlers can overwrite data in the .sh.sig >> compound variable while other signal traps are running and reading >> those data >> 6. ... long as ksh93's C code uses stuff like "critical sections" >> where signals are silently dropped >> >> I'm currently drafting a prototype patch[1] (which means: David... >> please wait for my prototype patch before trying to address this >> yourself) to fix these problems... but any feedback on the findings >> above would be nice. >> >> [1]=IMO the only practical solution to fix all these problems is to >> "offload" the execution of signal trap shell code into the main shell >> (stack), e.g. the C signal handlers only "record" all (including >> SIGCHLD) the signals (including siginfo data) in a queue and the shell >> processes this queue in-order. IMO this will prevent all the issues >> listed above, avoid data corruption, get rid of the critical sections >> and the usual stack exhaustion crashes "bash" and "dash" are famous >> for when too many signals arrive. >> >> BTW: As side-effect signal traps should only be called at shell >> command level boundaries and _not_ during processing of a shell >> command is underway. >> >> The "hard" part may be getting SIGCHLD handling in interactive shells >> right... basically each place where we wait for child processes or >> input needs to be followed with a check whether a new signal was >> queued. > > Attached (as "ksh93_sig_queue_hack001.diff.txt") is a crude hack which > shows that the shell can reliably process the SIGRTMIN signals (even > if they arrive in a massive signal storm... on Solaris I unleashed a > storm created by 1024 child processes coming from 128 CPUs without any > issues... running stable for two hours... :-) ) ... > > The basic concept of the patch works like this: > 1. |sh_fault()| (the signal handler function) is "reduced" to only > allocate a chunk of memory, store the { signal number, siginfo data > and the context pointer } in this chunk and queue it in a FIFO list
This sounds a lot like the signalfd() API. Have you looked at this concept? > 2. |sh_chktrap()| is responsible for "draining" the queue and then > process the signals one-by-one in the order in which they are received > (as Cedric found out the current implementation suffers from issues > that a SIGRTMIN signal can be received _after_ the SIGCHLD signal from > the same child process... ;-( ) Yeah, that was the big surprise we ran into a month ago when we evaluated .sh.sig vs job -l output. It turned out that ksh93 calls traps for signals from child processes which are dead for some 'ticks'. Correct ordering of signals *is* important, i.e. process them in the order they are received to prevent logic errors in ksh script-based applications. > > The result of this concept is that signals are processed reliably, can > no longer "get lost" and that signal handling can be much easier, e.g. > |sh_fault()| is much simpler (and no longer takes part in any changes > in signal masks etc.), there is no longer any need to defer signals, > have critical sections etc. and the special handling for SIGCHLD can > be removed from the processing chain, too. > > Below is Cedric's modified test code... note that the debug output now > shows that all signals are processed (regardless how I try to break > the code) and the compound variable array is correctly filled without > issues or wrong/missing data: Roland, thank you for the investigation and for trying to solve this problem. Lionel _______________________________________________ ast-developers mailing list [email protected] http://lists.research.att.com/mailman/listinfo/ast-developers
