On Mon, Jun 3, 2013 at 3:22 PM, Roland Mainz <[email protected]> wrote:
> On Sat, Jun 1, 2013 at 4:17 AM, Cedric Blancher
> <[email protected]> wrote:
>> I've tried to use the realtime signals in ast-ksh 20130524 to see if
>> they are reliable how but ran into instabilities again.
[snip]
>>
>> More than half of the signals are lost (which is a violation of the
>> POSIX realtime spec which mandates that realtime signals must be
>> reliable) and some of the array entries aren't even filled out (like
>> rtar[14][0].msg is missing).
>
> Grumpf... I spend half the weekend digging and debugging the signal
> handling code. Basically the issue is that signal+signal trap handling
> in ast-ksh.2013-05-24 is IMO utterly *broken* - while testing I found
> that SIGCHLD and SIGRTMIN signals are delivered _reliably_ to the
> shell but the implementation then looses between 8%-50% (measured on
> SuSE 12.3/Linux/AMD64 and Solaris 11) during signal trap processing.
>
> Example:
> I applied the following test patch to ast-ksh.2013-05-24:
> -- snip --
[snip]
> -- snip --
>
>
> Building the modified ksh93 version and running the script returns the
> following debug output:
> -- snip --
> $ arch/linux.i386-64/bin/ksh ~/tmp/shrttest1.sh >/dev/null
> #>>>>>>>>>>>>>>>>>>>> sum: numsigrt_received=30, numsigrt_processed=17
> -- snip --
>
> The output is more or less self-describing: 30 SIGRTMIN signals have
> been received but only 17 have been processed... which should not
> happen. IMO the trap for RTMIN should be called 30 times - one time
> for each SIGRTMIN signal received.
>
> The issue is not limited to RTMIN-RTMAX signal trap handling (for
> example SIGCHLD handling has similar problems... which are basically
> only avoided because each received SIGCHLD signal causes the shell to
> probe all registered child processes) ... the whole shell signal trap
> handling is IMO doomed (e.g. the same general issues apply to "bash"
> and "dash") as long as...
> 1. ... it happens on the signal stack itself
> 2. ... C signal handlers execute shell code. The result is usually
> silent data corruption within the shell. I call it "silent" since the
> symptoms are usually not visible from the shell level unless a lot of
> signals arrive and the shell scripts run long enough so the damage can
> accumulate
> 3. ... C signal handlers can poke around in every aspect of the
> |Shell_t| structure.
> 4. ... the code assumes that C statements like |sig &= ~SH_TRAP| are
> an atomic operation (OK... wrong statement used as example) and can't
> be interrupted.
> On most RISC platforms this statement results in 3-5 instructions...
> where another signal may interrupt between any of those 3-5
> instructions. This kind of issue can currently strike _everywhere_ in
> the code and leads to corruption issues when complex variables types
> like indexed arrays, associative arrays or compound variables are
> processed (Cedric's demo is a good example because multiple signal
> handlers work on the same 2D indexed array).
> 5. ... interleaving signal handlers can overwrite data in the .sh.sig
> compound variable while other signal traps are running and reading
> those data
> 6. ... long as ksh93's C code uses stuff like "critical sections"
> where signals are silently dropped
>
> I'm currently drafting a prototype patch[1] (which means: David...
> please wait for my prototype patch before trying to address this
> yourself) to fix these problems... but any feedback on the findings
> above would be nice.
>
> [1]=IMO the only practical solution to fix all these problems is to
> "offload" the execution of signal trap shell code into the main shell
> (stack), e.g. the C signal handlers only "record" all (including
> SIGCHLD) the signals (including siginfo data) in a queue and the shell
> processes this queue in-order. IMO this will prevent all the issues
> listed above, avoid data corruption, get rid of the critical sections
> and the usual stack exhaustion crashes "bash" and "dash" are famous
> for when too many signals arrive.
>
> BTW: As side-effect signal traps should only be called at shell
> command level boundaries and _not_ during processing of a shell
> command is underway.
>
> The "hard" part may be getting SIGCHLD handling in interactive shells
> right... basically each place where we wait for child processes or
> input needs to be followed with a check whether a new signal was
> queued.
Attached (as "ksh93_sig_queue_hack001.diff.txt") is a crude hack which
shows that the shell can reliably process the SIGRTMIN signals (even
if they arrive in a massive signal storm... on Solaris I unleashed a
storm created by 1024 child processes coming from 128 CPUs without any
issues... running stable for two hours... :-) ) ...
The basic concept of the patch works like this:
1. |sh_fault()| (the signal handler function) is "reduced" to only
allocate a chunk of memory, store the { signal number, siginfo data
and the context pointer } in this chunk and queue it in a FIFO list
2. |sh_chktrap()| is responsible for "draining" the queue and then
process the signals one-by-one in the order in which they are received
(as Cedric found out the current implementation suffers from issues
that a SIGRTMIN signal can be received _after_ the SIGCHLD signal from
the same child process... ;-( )
The result of this concept is that signals are processed reliably, can
no longer "get lost" and that signal handling can be much easier, e.g.
|sh_fault()| is much simpler (and no longer takes part in any changes
in signal masks etc.), there is no longer any need to defer signals,
have critical sections etc. and the special handling for SIGCHLD can
be removed from the processing chain, too.
Below is Cedric's modified test code... note that the debug output now
shows that all signals are processed (regardless how I try to break
the code) and the compound variable array is correctly filled without
issues or wrong/missing data:
-- snip --
$ ./arch/linux.i386-64/bin/ksh ~/tmp/shrttest1.sh
#DONE.
#>>>>>>>>>>>>>>>>>>>> sum: numsigrt_received=3072, numsigrt_processed=3072
$ cat ~/tmp/shrttest1.sh
compound -a rtar
function rttrap
{
integer v=${.sh.sig.value}
integer s=${#rtar[v][@]}
rtar[$v][$s]=(
integer pid=${.sh.sig.pid}
typeset msg=${v}
)
}
trap 'rttrap' RTMIN
typeset m # used in child processes
integer pid p i
# note that "numchildren" can't be higher than 4*$(ulimit -i)
integer -r numchildren=128
(( pid=$$ ))
for (( p=0 ; p < numchildren ; p++ )) ; do
{
# sleep for a moment to make sure the
# other child processes can become ready
# (sort of "poor man's" "barrier")
sleep 2
for m in 'a' 'b' 'c' 'd' 'e' 'f' ; do
kill -q $((16#$m)) -s RTMIN $pid
kill -q $((16#$m)) -s RTMIN $pid
kill -q $((16#$m)) -s RTMIN $pid
kill -q $((16#$m)) -s RTMIN $pid
done
sleep 5 # make sure SIGCHLD doesn't interfere
exit 0
} &
done
# wait for children to terminate
# we need to loop here since wait(1) may be aborted
# by signals
while ! wait ; do
true
done
# hack: force signal queue processing
true | read dummy
#
# validate results...
#
bool fail=false
if ! (( ${#rtar[@]} == 6 )) ; then
printf "got {#rtar[@]}=%d, expected 6\n" ${#rtar[@]}
(( fail=true ))
fi
for (( i=0xa ; i <= 0xf; i++ )) ; do
if ! (( ${#rtar[$i][*]} == (numchildren*4) )) ; then
printf "got {#rtar[%d][*]}=%d," \
$i ${#rtar[$i][*]}
printf "expected %d\n" $((numchildren*4))
(( fail=true ))
fi
done
((fail)) && print -v rtar
print '#DONE.'
-- snip --
----
Bye,
Roland
--
__ . . __
(o.\ \/ /.o) [email protected]
\__\/\/__/ MPEG specialist, C&&JAVA&&Sun&&Unix programmer
/O /==\ O\ TEL +49 641 3992797
(;O/ \/ \O;)
diff -r -u src/cmd/ksh93/sh/fault.c src/cmd/ksh93/sh/fault.c
--- src/cmd/ksh93/sh/fault.c 2013-05-22 19:33:13.000000000 +0200
+++ src/cmd/ksh93/sh/fault.c 2013-06-03 23:21:35.782217127 +0200
@@ -72,13 +72,14 @@
return(action);
}
-/*
- * Most signals caught or ignored by the shell come here
-*/
+volatile int numsigrt_received=0;
+volatile int numsigrt_processed=0;
+
+
#ifdef _lib_sigaction
-void sh_fault(register int sig, siginfo_t *info, void *context)
+void sh_fault2(register int sig, siginfo_t *info, void *context)
#else
-void sh_fault(register int sig)
+void sh_fault2(register int sig)
#endif
{
register Shell_t *shp = sh_getinterp();
@@ -86,10 +87,9 @@
register char *trap;
register struct checkpt *pp = (struct checkpt*)shp->jmplist;
int action=0;
+
/* reset handler */
- if(!(sig&SH_TRAP))
- signal(sig, sh_fault);
- sig &= ~SH_TRAP;
+
#ifdef SIGWINCH
if(sig==SIGWINCH)
{
@@ -417,14 +417,13 @@
* check for traps
*/
-void sh_chktrap(Shell_t* shp)
+void sh_chktrap2(Shell_t* shp)
{
register int sig=shp->st.trapmax;
register char *trap;
if(!(shp->trapnote&~SH_SIGIGNORE))
sig=0;
- if(sh.intrap)
- return;
+
shp->trapnote &= ~SH_SIGTRAP;
/* execute errexit trap first */
if(sh_isstate(shp,SH_ERREXIT) && shp->exitval)
@@ -446,14 +445,11 @@
sh_exit(shp,shp->exitval);
}
}
- if(!shp->sigflag)
- return;
+
if(shp->sigflag[SIGALRM]&SH_SIGALRM)
sh_timetraps(shp);
while(--sig>=0)
{
- if(sig==cursig)
- continue;
if(shp->sigflag[sig]&SH_SIGTRAP)
{
shp->sigflag[sig] &= ~SH_SIGTRAP;
@@ -469,6 +465,11 @@
sh_setsiginfo((siginfo_t*)shp->siginfo[sig]);
#endif
cursig = sig;
+
+if (sig==SIGRTMIN)
+{
+ numsigrt_processed++;
+}
sh_trap(shp,trap,0);
#ifdef _lib_sigaction
if(shp->siginfo[sig] && sig!=SIGCHLD)
@@ -482,6 +483,77 @@
}
+typedef struct _siginfo_chain_t siginfo_chain_t;
+struct _siginfo_chain_t
+{
+ siginfo_chain_t *next;
+ int sig;
+ siginfo_t si;
+ void *context;
+};
+
+volatile siginfo_chain_t *sic_queue=NULL;
+
+/*
+ * Most signals caught or ignored by the shell come here
+*/
+void sh_fault(register int sig, siginfo_t *info, void *context)
+{
+ siginfo_chain_t *si;
+
+ if (sig==SIGRTMIN)
+ {
+ numsigrt_received++;
+ }
+
+ si = calloc(sizeof(siginfo_chain_t), 1);
+ si->next = NULL;
+ si->sig = sig;
+ si->context = context;
+ memcpy(&si->si,info,sizeof(siginfo_t));
+
+ /*
+ * queue signal
+ * WARNING: this version is not async-signal-safe!!
+ */
+ if (sic_queue)
+ {
+ siginfo_chain_t *chain = (siginfo_chain_t *)sic_queue;
+ while (chain->next != NULL)
+ chain=chain->next;
+ chain->next = si;
+ }
+ else
+ sic_queue = si;
+}
+
+void sh_chktrap(Shell_t* shp)
+{
+ siginfo_chain_t *si;
+ siginfo_chain_t *chain;
+
+ /*
+ * grab queue for processing
+ * WARNING: This version is not async-signal-safe
+ */
+ si=sic_queue;
+ sic_queue=NULL;
+
+ /*
+ * Process queue and call |sh_fault2()| and |sh_chktrap2()|
+ * for each signal in sequence
+ */
+ while (si != NULL)
+ {
+ sh_fault2(si->sig, &si->si, si->context);
+ sh_chktrap2(shp);
+
+ chain=si->next;
+ free(si);
+ si=chain;
+ }
+}
+
/*
* exit the current scope and jump to an earlier one based on pp->mode
*/
diff -r -u src/cmd/ksh93/sh/main.c src/cmd/ksh93/sh/main.c
--- src/cmd/ksh93/sh/main.c 2013-05-18 18:09:01.000000000 +0200
+++ src/cmd/ksh93/sh/main.c 2013-06-03 12:42:35.020704526 +0200
@@ -369,6 +369,12 @@
sh_onstate(shp,SH_INTERACTIVE);
nv_putval(IFSNOD,(char*)e_sptbnl,NV_RDONLY);
exfile(shp,iop,fdin);
+
+extern volatile int numsigrt_received;
+extern volatile int numsigrt_processed;
+sfprintf(sfstderr, "#>>>>>>>>>>>>>>>>>>>> sum: numsigrt_received=%d,
numsigrt_processed=%d\n", numsigrt_received, numsigrt_processed);
+
+
sh_done(shp,0);
/* NOTREACHED */
return(0);
_______________________________________________
ast-developers mailing list
[email protected]
http://lists.research.att.com/mailman/listinfo/ast-developers