Re: [ast-developers] Realtime signal issues, revised, still loosing signals

Lionel Cons Tue, 04 Jun 2013 04:34:54 -0700

On 4 June 2013 02:24, Roland Mainz <[email protected]> wrote:
> On Mon, Jun 3, 2013 at 3:22 PM, Roland Mainz <[email protected]> wrote:
>> On Sat, Jun 1, 2013 at 4:17 AM, Cedric Blancher
>> <[email protected]> wrote:
>>> I've tried to use the realtime signals in ast-ksh 20130524 to see if
>>> they are reliable how but ran into instabilities again.
> [snip]
>>>
>>> More than half of the signals are lost (which is a violation of the
>>> POSIX realtime spec which mandates that realtime signals must be
>>> reliable) and some of the array entries aren't even filled out (like
>>> rtar[14][0].msg is missing).
>>
>> Grumpf... I spend half the weekend digging and debugging the signal
>> handling code. Basically the issue is that signal+signal trap handling
>> in ast-ksh.2013-05-24 is IMO utterly *broken* - while testing I found
>> that SIGCHLD and SIGRTMIN signals are delivered _reliably_ to the
>> shell but the implementation then looses between 8%-50% (measured on
>> SuSE 12.3/Linux/AMD64 and Solaris 11) during signal trap processing.
>>
>> Example:
>> I applied the following test patch to ast-ksh.2013-05-24:
>> -- snip --
> [snip]
>> -- snip --
>>
>>
>> Building the modified ksh93 version and running the script returns the
>> following debug output:
>> -- snip --
>> $ arch/linux.i386-64/bin/ksh ~/tmp/shrttest1.sh >/dev/null
>> #>>>>>>>>>>>>>>>>>>>> sum: numsigrt_received=30, numsigrt_processed=17
>> -- snip --
>>
>> The output is more or less self-describing: 30 SIGRTMIN signals have
>> been received but only 17 have been processed... which should not
>> happen. IMO the trap for RTMIN should be called 30 times - one time
>> for each SIGRTMIN signal received.
>>
>> The issue is not limited to RTMIN-RTMAX signal trap handling (for
>> example SIGCHLD handling has similar problems... which are basically
>> only avoided because each received SIGCHLD signal causes the shell to
>> probe all registered child processes) ... the whole shell signal trap
>> handling is IMO doomed (e.g. the same general issues apply to "bash"
>> and "dash") as long as...
>> 1. ... it happens on the signal stack itself
>> 2. ... C signal handlers execute shell code. The result is usually
>> silent data corruption within the shell. I call it "silent" since the
>> symptoms are usually not visible from the shell level unless a lot of
>> signals arrive and the shell scripts run long enough so the damage can
>> accumulate
>> 3. ... C signal handlers can poke around in every aspect of the
>> |Shell_t| structure.
>> 4. ... the code assumes that C statements like |sig &= ~SH_TRAP| are
>> an atomic operation (OK... wrong statement used as example) and can't
>> be interrupted.
>> On most RISC platforms this statement results in 3-5 instructions...
>> where another signal may interrupt between any of those 3-5
>> instructions. This kind of issue can currently strike _everywhere_ in
>> the code and leads to corruption issues when complex variables types
>> like indexed arrays, associative arrays or compound variables are
>> processed (Cedric's demo is a good example because multiple signal
>> handlers work on the same 2D indexed array).
>> 5. ... interleaving signal handlers can overwrite data in the .sh.sig
>> compound variable while other signal traps are running and reading
>> those data
>> 6. ... long as ksh93's C code uses stuff like "critical sections"
>> where signals are silently dropped
>>
>> I'm currently drafting a prototype patch[1] (which means: David...
>> please wait for my prototype patch before trying to address this
>> yourself) to fix these problems... but any feedback on the findings
>> above would be nice.
>>
>> [1]=IMO the only practical solution to fix all these problems is to
>> "offload" the execution of signal trap shell code into the main shell
>> (stack), e.g. the C signal handlers only "record" all (including
>> SIGCHLD) the signals (including siginfo data) in a queue and the shell
>> processes this queue in-order. IMO this will prevent all the issues
>> listed above, avoid data corruption, get rid of the critical sections
>> and the usual stack exhaustion crashes "bash" and "dash" are famous
>> for when too many signals arrive.
>>
>> BTW: As side-effect signal traps should only be called at shell
>> command level boundaries and _not_ during processing of a shell
>> command is underway.
>>
>> The "hard" part may be getting SIGCHLD handling in interactive shells
>> right... basically each place where we wait for child processes or
>> input needs to be followed with a check whether a new signal was
>> queued.
>
> Attached (as "ksh93_sig_queue_hack001.diff.txt") is a crude hack which
> shows that the shell can reliably process the SIGRTMIN signals (even
> if they arrive in a massive signal storm... on Solaris I unleashed a
> storm created by 1024 child processes coming from 128 CPUs without any
> issues... running stable for two hours... :-) ) ...
>
> The basic concept of the patch works like this:
> 1. |sh_fault()| (the signal handler function) is "reduced" to only
> allocate a chunk of memory, store the { signal number, siginfo data
> and the context pointer } in this chunk and queue it in a FIFO list


This sounds a lot like the signalfd() API. Have you looked at this concept?

> 2. |sh_chktrap()| is responsible for "draining" the queue and then
> process the signals one-by-one in the order in which they are received
> (as Cedric found out the current implementation suffers from issues
> that a SIGRTMIN signal can be received _after_ the SIGCHLD signal from
> the same child process... ;-( )

Yeah, that was the big surprise we ran into a month ago when we
evaluated .sh.sig vs job -l output. It turned out that ksh93 calls
traps for signals from child processes which are dead for some
'ticks'. Correct ordering of signals *is* important, i.e. process them
in the order they are received to prevent logic errors in ksh
script-based applications.

>
> The result of this concept is that signals are processed reliably, can
> no longer "get lost" and that signal handling can be much easier, e.g.
> |sh_fault()| is much simpler (and no longer takes part in any changes
> in signal masks etc.), there is no longer any need to defer signals,
> have critical sections etc. and the special handling for SIGCHLD can
> be removed from the processing chain, too.
>
> Below is Cedric's modified test code... note that the debug output now
> shows that all signals are processed (regardless how I try to break
> the code) and the compound variable array is correctly filled without
> issues or wrong/missing data:

Roland, thank you for the investigation and for trying to solve this problem.

Lionel
_______________________________________________
ast-developers mailing list
[email protected]
http://lists.research.att.com/mailman/listinfo/ast-developers

Re: [ast-developers] Realtime signal issues, revised, still loosing signals

Reply via email to