On Fri, 28 Sep 2012 10:21:49 +0200 Cedric Blancher wrote:
> On 28 September 2012 07:44, Glenn Fowler <g...@research.att.com> wrote:
> >
> > { INIT ast-ksh } 2012-09-27 alphas posted to
> >         www.research.att.com/sw/download/alpha/

> We experience a lot of failures with ast-ksh 2012-09-27 on Suse 12.2
> Linux and latest Fedora:

> test arith begins at 2012-09-28+08:51:50
>         arith.sh[420]: compound var arithmetic failed
>         arith.sh[421]: compound var arithmetic failed
>         arith.sh[422]: compound var arithmetic failed
>         arith.sh[423]: compound var arithmetic failed
>         arith.sh[424]: compound var arithmetic failed
>         arith.sh[425]: compound var arithmetic failed
>         arith.sh[426]: compound var arithmetic failed
> test arith failed at 2012-09-28+08:51:50 with exit code 1 [ 201 tests 1 error 
> ]
> test attributes begins at 2012-09-28+09:19:32
>         attributes.sh[128]: attributes not cleared for script execution
>         attributes.sh[133]: typeset -L should not be inherited
> test attributes failed at 2012-09-28+09:19:34 with exit code 1 [ 110
> tests 1 error ]
> test attributes(shcomp) begins at 2012-09-28+09:19:34
>         shcomp-attributes.ksh[128]: attributes not cleared for script 
> execution
>         shcomp-attributes.ksh[133]: typeset -L should not be inherited
> test attributes(shcomp) failed at 2012-09-28+09:19:36 with exit code 2
> [ 110 tests 2 errors ]
> test basic begins at 2012-09-28+09:19:36
>         basic.sh[165]: script not working
>         basic.sh[171]: output file pointer not shared correctly
>         basic.sh[198]: builtin replaces standard input pipe
>         basic.sh[204]: $0 not correct for . script
>         basic.sh[211]: nested scripts failed
>         basic.sh[215]: scripts in subshells fail
>         basic.sh[350]: piping into script fails
>         basic.sh[359]: script pipe to shell fails
> blabla

> We've traced this down to the nonconforming glibc/Linux implementation
> of posix_spawn() - disabling it cures the problem on Linux. I
> crosschecked with the AIX build - it uses posix_spawn() the same way
> but without triggering any failures.
> I think this is a follow-up to
> http://marc.info/?l=ast-developers&m=134785274012526&w=2 - I can't
> agree with the assertion of Redhat's Michal Hlavinka that glibc
> posix_spawn() is right, because the current behaviour is IMO useless
> for use in a shell (hence the failures in the testsuite), and think a
> fix in glibc is still required.

to recap:

        grep _lib_posix_spawn arch/*/src/lib/libast/FEATURE/lib

there are 3 possible results
(1) not there => posix_spawn() unusable
(2) #define _lib_posix_spawn 2 => works with no workarounds
(3) #define _lib_posix_spawn 1 => works but posix_spawn() on an executable
    file that would fail with ENOEXEC via execve() creates a process
    that exits with status 127

our sol10.* systems have _lib_posix_spawn 1 and they work
so something else is going on
(we don't have a linux system with the new glibc posix_spawn())
it may be a timing problem with this logic in
        src/lib/libast/misc/spawnvex.c
(spawnvex() is new and the api has not settled yet)

#if _lib_posix_spawn < 2
        if (waitpid(pid, &err, WNOHANG|WNOWAIT) == pid && EXIT_STATUS(err) == 
127)
        {
                while (waitpid(pid, NiL, 0) == -1 && errno == EINTR);
                if (!access(path, X_OK))
                        errno = ENOEXEC;
                pid = -1;
        }
#endif

can you do an strace and see what the waitpid() is returning?

my guess is on solaris the child process has exited 127 on ENOEXEC
before the waitpid(pid, &err, WNOHANG|WNOWAIT) and on linux
the process has not yet exited (but looking at build log over the
last week I see some spurious exit code 127 failures on solaris,
so it looks like a timing problem even for solaris)

the standard allows exit code 127 for fork()/exec()
in the case of ENOEXEC producing a child process that will eventually
exit 127 I'm beginning to fear that there is no way to work around
the timing window -- a sleep() before waitpid() would be dumb and not
guaranteed to work anyway -- the posix_spawn() wrapper could
check the magic number but I don't want to get into the magic number game
that's exec*()'s job

so if it is a timing window, the iffe test will have to fail
posix_spawn() implementations that create a child process for ENOEXEC
and if that's the case it shows how usesless posix_spawn() is
because the caller only knows exit status 127, not the root of the problem

in the case of the shell calling posix_spawn() it must know the reason for 
failure
ENOEXEC means the shell can attempt to treat the executable as a script
not so for exit code 127

I just noticed that this code is not strictly portable because it
relies on the non-standard linux WNOWAIT without iffe-ing or #ifdef-ing it
for now its ok (by luck) because _lib_posix_spawn=1 only on { linux solaris }
I'll modify the iffe test to only emit _lib_posix_spawn=1 if WNOWAIT is defined
otherwise posix_spawn() is useless because the spwanvex() wrapper must
not interfere with the caller's ability to wait on the spawned process
(but if we never get _lib_posix_spawn=1 this observation is moot)

_______________________________________________
ast-developers mailing list
ast-developers@research.att.com
https://mailman.research.att.com/mailman/listinfo/ast-developers

Reply via email to