Based on this analysis by Frédéric Bonnard (thanks a lot!) I infer that the "sleep 5" does not fix the problem at all, so I'm going to disable the test completely for buster, as it's clearly buggy.
----- Forwarded message from Frédéric Bonnard <fre...@debian.org> ----- Date: Wed, 27 Feb 2019 16:48:38 +0100 From: Frédéric Bonnard <fre...@debian.org> To: Breno Leitao <lei...@debian.org>, Santiago Vila <sanv...@unex.es>, ppc6...@buildd.debian.org Subject: Re: diffutils test suite Hi, February 18, 2019 7:04 PM, "Breno Leitao" <lei...@debian.org> wrote: > hi Santiago, > > On 2/18/19 2:46 PM, Santiago Vila wrote: > >> Hello. >> >> I have a problem with diffutils testsuite. >> >> There is a test called "colors" which fails in ppc64el. >> >> The first time this happened I added a "sleep 1" to the test >> to avoid what it seemed to be a race condition. Now I have increased >> the time to 5 seconds, but it still fails on powerpc and ppc64. >> >> Are those architectures so slow that they need more than 5 seconds for >> a fifo to work? > > I do not think so. In fact, ppc64el seems to be the fastest builder at debian > today. To summarize the context, the failing test is : -- mkfifo fifo printf '%1000000s-a' > a printf '%1000000s-b' > b head -c 10 < fifo > /dev/null & sleep 5 diff --color=always ---presume-output-tty a b > fifo test $? = 141 || fail=1 -- The expected behaviour of diff is that is should catch the SIGPIPE when head has read its 10bytes from the fifo given the big a and b file, and return 141 (141 = 128 + SIGPIPE) Quite often 141 is returned. I straced a case where diff returns 2 (EXIT_TROUBLE) (that is our failing case) : -- 24846 write(1, " "..., 4096) = 4096 24846 write(1, " "..., 4096) = -1 EPIPE (Broken pipe) 24846 --- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, si_pid=24846, si_uid=1000} --- 24846 rt_sigreturn({mask=[]}) = -1 EPIPE (Broken pipe) 24846 munmap(0x7fffa0cf0000, 262144) = 0 24846 munmap(0x7fffa0fa0000, 262144) = 0 24846 munmap(0x7fffa0cb0000, 262144) = 0 24846 munmap(0x7fffa0f60000, 262144) = 0 24846 munmap(0x7fffa10e0000, 1048576) = 0 24846 munmap(0x7fffa0fe0000, 1048576) = 0 24846 close(3) = 0 24846 close(4) = 0 24846 write(1, "\33[0m\n\\ No newline at end of file"..., 33) = -1 EPIPE (Broken pipe) 24846 --- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, si_pid=24846, si_uid=1000} --- 24846 rt_sigreturn({mask=[]}) = -1 EPIPE (Broken pipe) 24846 write(2, "/build/diffutils-nPN2r4/diffutil"..., 48) = 48 24846 write(2, "standard output", 15) = 15 24846 write(2, ": Broken pipe", 13) = 13 24846 write(2, "\n", 1) = 1 24846 exit_group(2) = ? 24846 +++ exited with 2 +++ 24845 <... wait4 resumed> [{WIFEXITED(s) && WEXITSTATUS(s) == 2}], 0, NULL) = 24846 24845 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=24846, si_uid=1000, si_status=2, si_utime=0, si_stime=0} --- 24845 rt_sigreturn({mask=[]}) = 24846 24845 dup2(10, 1) = 1 24845 close(10) = 0 24845 exit_group(2) = ? 24845 +++ exited with 2 +++ -- We can see that diff reports "Broken pipe" with the standard output. and here is the working case : --- 24987 write(1, " "..., 4096) = 4096 24987 write(1, " "..., 4096) = -1 EPIPE (Broken pipe) 24987 --- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, si_pid=24987, si_uid=1000} --- 24987 rt_sigreturn({mask=[]}) = -1 EPIPE (Broken pipe) 24987 write(1, "\33[0m\n\\ No newline at end of file"..., 56) = -1 EPIPE (Broken pipe) 24987 --- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, si_pid=24987, si_uid=1000} --- 24987 rt_sigreturn({mask=[]}) = -1 EPIPE (Broken pipe) 24987 rt_sigprocmask(SIG_BLOCK, [HUP INT QUIT PIPE ALRM TERM TSTP XCPU XFSZ VTALRM PROF IO], [], 8) = 0 24987 rt_sigaction(SIGPIPE, {sa_handler=SIG_DFL, sa_mask=[PIPE], sa_flags=SA_RESTART}, {sa_handler=0x12ff80780, sa_mask=[HUP INT QUIT PIPE ALRM TERM TSTP XCPU XFSZ VTALRM PROF IO], sa_flags=SA_RESTART}, 8) = 0 24987 rt_sigprocmask(SIG_BLOCK, ~[RTMIN RT_1], [HUP INT QUIT PIPE ALRM TERM TSTP XCPU XFSZ VTALRM PROF IO], 8) = 0 24987 getpid() = 24987 24987 gettid() = 24987 24987 tgkill(24987, 24987, SIGPIPE) = 0 24987 rt_sigprocmask(SIG_SETMASK, [HUP INT QUIT PIPE ALRM TERM TSTP XCPU XFSZ VTALRM PROF IO], NULL, 8) = 0 24987 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 24987 --- SIGPIPE {si_signo=SIGPIPE, si_code=SI_TKILL, si_pid=24987, si_uid=1000} --- 24987 +++ killed by SIGPIPE +++ 24986 <... wait4 resumed> [{WIFSIGNALED(s) && WTERMSIG(s) == SIGPIPE}], 0, NULL) = 24987 24986 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_KILLED, si_pid=24987, si_uid=1000, si_status=SIGPIPE, si_utime=0, si_stime=0} --- 24986 rt_sigreturn({mask=[]}) = 24987 24986 dup2(10, 1) = 1 24986 close(10) = 0 24986 exit_group(141) = ? 24986 +++ exited with 141 +++ --- So, when the test fails, we actually got some SIGPIPE but we returned 2. We also see that diff already finished (it closes the a and b file) (src/diff.c end of compare_files() ) Reading the code, I see that the signals are catched and processed later by calling process_signals() periodically and that function may raise the caught signal and we then exit with 141. This way we can exit the process 141 in process_signals() or in other functions such as pfatal_with_name() with 2 and "Broken pipe" message (when a system call returns non-zero status and that is fatal). In our case, the test expects diff to exit in process_signals() but not in pfatal_with_name() which happens sometimes because we terminate the processing *after* the last check of pending signals i.e. process_signals() has occured, that is with exit code EXIT_TROUBLE and errno 13 and pfatal_with_name() exits and displays the error message. In both case, we trap the SIGPIPE problem, but it's reported differently as we exit in different functions. There is a window large enough in which a signal can occur and won't be processed by process_signals. Maybe the code should check before exiting that there are no pending signals to reduce that window. Or it should process the signals in a whole different manner. I wrote a small patch but I'm not sure this is the good way to handle the situation, but that works. F. >> The current version of diffutils (1:3.7-2) now seems to build on every >> release arch, but I believe that more because of pure chance than >> because the patch I added does anything useful at all. >> >> I'm not going to touch the package until it reaches testing (ten days), >> but after that I will welcome any help to fix this issue. > > Fred, would you please help Santiago with this problem please? F. --- a/src/diff.h +++ b/src/diff.h @@ -390,6 +390,7 @@ extern void begin_output (void); extern void debug_script (struct change *); extern void fatal (char const *) __attribute__((noreturn)); +extern void final_process_signals (void); extern void finish_output (void); extern void message (char const *, char const *, char const *); extern void message5 (char const *, char const *, char const *, --- a/src/util.c +++ b/src/util.c @@ -1574,3 +1574,14 @@ fflush (stderr); } + +/* Process remaining signals once before exit */ +void +final_process_signals (void) +{ + static int last = 1; + if (last) { + process_signals (); + last = 0; + } +} --- a/src/diff.c +++ b/src/diff.c @@ -1451,6 +1451,8 @@ } } + final_process_signals (); + /* Now the comparison has been done, if no error prevented it, and STATUS is the value this function will return. */ ----- End forwarded message -----