Re: [ast-developers] Hang because job management _within_ the SIGCHLD signal handler... / was: Re: AT&T Software Technology ast alpha software download update

Irek Szczesniak Tue, 30 Jul 2013 08:14:38 -0700

On Tue, Jul 30, 2013 at 4:49 PM, Glenn Fowler <[email protected]> wrote:
>
> On Tue, 30 Jul 2013 16:16:28 +0200 Roland Mainz wrote:
>> On Tue, Jul 30, 2013 at 4:12 PM, Glenn Fowler <[email protected]> wrote:
>> >
>> > On Tue, 30 Jul 2013 15:40:48 +0200 Roland Mainz wrote:
>> >> --047d7b6d8d7cb1934204e2bac182
>> >> Content-Type: text/plain; charset=ISO-8859-1
>> >
>> >> On Mon, Jul 29, 2013 at 3:34 PM, Roland Mainz <[email protected]> 
>> >> wrote:
>> >> > On Sat, Jul 27, 2013 at 6:17 PM, Glenn Fowler <[email protected]> 
>> >> > wrote:
>> >> [snip]
>> >> > I hit (again) one of the nasty and hard to reproduce problems (which
>> >> > means: No... I don't have a testcase... ;-( ) related to job
>> >> > management:
>> >> > -- snip --
>> >> [snip]
>> >> > -- snip --
>> >> >
>> >> > ... this hang occured because ksh93 tries to do job management from
>> >> > within the signal handler... which is IMO _far_ from being safe to
>> >> > do...
>> >> >
>> >> > David: Is there *ANY* way the job management can be moved out of the
>> >> > signal handler and just rely on the siginfo data ?
>> >
>> >> Attached (as 
>> >> "astksh20130727_prototype_process_sigchld_outside_signalhandler001.diff.txt")
>> >> is a prototype (hacked) patch for discussion.
>> >
>> >> * The good news are:
>> >> - SIGCHLD processing now seems to be reliable. I can't find a simple
>> >> testcase to break it (but see the item about the testsuite below)
>> >> - SIGCHLD processing now works without problems under "valgrind" control
>> >> - The patch seems to fix a lot of Cedric's issues with signals being
>> >> lost. Even the testcase where a $ ... while ! wait ; sleep 8 ; done #
>> >> was loosing signals now works... if I replace the "sleep 8" with
>> >> "compound -a dummy ; poll -t 8 dummy" (so there seems to be a seperate
>> >> problem with "sleep" ... somehow)
>> >
>> >> * The bad news are:
>> >> - The patch doesn't work very well for interactive shells
>> >> - The "sigchld.sh" testsuite module fails in several places... AFAIK
>> >> the reason is that we don't check the list of queued CHLD signals in
>> >> all places where they should be checked (e.g. each time before we
>> >> execute a shell statement and before and after running external
>> >> processes and each time wait(1) internally wakes up from a signal)
>> >> - $ jobs -l # reports already purged processes... erm... no clue why... 
>> >> ;-(
>> >
>> >> * Notes about interactive mode:
>> >> An interactive shell AFAIK requires to report terminating childs
>> >> "immediately" if some shell options are set (erm... David: Which
>> >> option is this again ?). AFAIK this can be archived by two ways:
>> >> 1. wake up from time-to-time (e.g. polling) ... but this has a bad
>> >> effect in CPU usage (imagine 1000 shell sessions in a multi-user
>> >> system waking-up 10 times per second... or imagine the impact for
>> >> battery life in a laptop)
>> >> 2. add some glue to the editor code to wake up for each signald and
>> >> process it when |sfpkrd()| returns...
>> >> -- snip --
>> >> (gdb) where
>> >> #0  0x00007fc2be588860 in __poll_nocancel () at
>> >> ../sysdeps/unix/syscall-template.S:81
>> >> #1  0x000000000050280d in sfpkrd (fd=0, argbuf=0x7fff33545080, n=80,
>> >> rc=0, tm=-1, action=1)
>> >>     at 
>> >> /home/test001/work/ast_ksh_20130727/build_i386_64bit_debug_patched/src/lib/libast/sfio/sfpkrd.c:142
>> >> #2  0x0000000000410fb9 in ed_read (context=0x7fc2bf063960, fd=0,
>> >> buff=0x7fff33545080 "\240PT3\377\177", size=80, reedit=0)
>> >>     at 
>> >> /home/test001/work/ast_ksh_20130727/build_i386_64bit_debug_patched/src/cmd/ksh93/edit/edit.c:884
>> >> -- snip --
>> >> ... that's the strategy my patch attempts...
>> >
>> > for the current code do you see failures when the vmalloc debug method is 
>> > not used?
>
>> Sporadically... yes. But I'm not sure whether this is an issue in
>> vmalloc or something else. I'm not even sure whether the job lkist
>> handling code in ksh93 is the root cause of most of the trouble. This
>> is all made much harder because Linux doesn't have much usefull tools
>> to instrument signal handlers so I have to walk the code by hand.
>
> it might be good to nail down this problem first
> otherwise we might be applying fixes for something that's not broken


Chime in on this topic. sh_fault() is one piece of brokenness from head to tail.
Just from looking at it for five minutes I can tell you:
- SIGWINCH handling is broken. You can't seriously assume that
nv_putval() is save to use in a signal handler. Practical testing
shows that if you run unset LINES COLUMNS and recreate the variables
in a loop and resize the window (just drag the right bottom border
around in Gnome or KDE) the storm of WINCH signals will trigger a SEGV
sooner or later in a random location of shell code
- A lot of variables like shp->st.trapcom[sig] are accessed from
inside and outside the signal handler without protection, i.e. atomic
load/store instructions. You permanently risk a broken state if
signals of different types meet each other in sh_fault()
- shp->trapnote & | = anything <-- what happens in nested signal
handler calls? Guess?
- Another notorious abuse:
   482                                  if(shp->siginfo)
   483                                  {
   484                                          do ip = shp->siginfo[sig];
   485                                          while
(asocasptr(&shp->siginfo[sig],ip,0)!=ip);
   486                                  }
How is this suppose to be signal safe? You read from shp->siginfo[sig]
unsafe but do the write using atomic instructions? Likewise filling
the queue must be done with atomic instructions to handle nested
signal handler calls.
- All variables read or written from both a signal handler and user
code must be of storage type volatile. Missing here, too.
Surprisingly, if you look at the assembler code, the compiler caches a
lot of values in registers or hidden temporary stack variables. May be
another source of trouble. May? Not may. Correction: It is.

>From what I can glance in 5 minutes sh_fault() is notoriously signal
unsafe by violating every known rule about writing signal handlers
(i.e. keep the code as short as possible, use only async signal safe
functions, use volatile storage and atomic r/w instructions for
communication) and should be thrown away. No amount of one line fixes
can rescue it. Please. Just throw this **** away and rewrite it from
scratch.

Irek
_______________________________________________
ast-developers mailing list
[email protected]
http://lists.research.att.com/mailman/listinfo/ast-developers

Re: [ast-developers] Hang because job management _within_ the SIGCHLD signal handler... / was: Re: AT&T Software Technology ast alpha software download update

Reply via email to