Re: Killing a zombie process?
Hello. Did you mistype, did I misread or did you really mean to say that the parent pid (ppid) is 0 on the offending zombie process? that could be a clue. The ppid should be 1, not 0. I wonder how, if that is the case, the ppid of 0 gets assigned instead of 1? -thanks -Brian On Sep 30, 3:55pm, Paul Goyette wrote: } Subject: Re: Killing a zombie process? } On Wed, 30 Sep 2015, Paul Goyette wrote: } } >> # kill -HUP 1 } >> # ps axl | grep ' Z ' } >> 0 27237 1 0 0 0 0 0 - Zpts/2- 0:00.00 } >> (sh) } > } > Well, it happened again! } > } > I rebooted earlier today, and then deinstalled and rebuilt about 40 } > packages within the pkgsrc/sysutils/mksandbox environment (all with } > MAKE_JOBS=3 enabled). After all packages were rebuilt, I exit from } > the sandbox and run ./sandbox/dismount and get the error } > } > umount: /sandbox/bin: Device busy } > } > Sure enough, there's a new Zombie process, and its parent seems to be } > init (PPID==0) } > } > # ps axl | grep ' Z ' } >0 23848 28120 85 0 4360164 pipe_rd R+ pts/2 0:00.00 } > grep Z } >0 2543910 0 0 0 0 - Zpts/2 0:00.00 } > (sh) } > } > HUPing init still doesn't help. } > } > So, I'm pretty sure that there's a bug somewhere, but haven't a clue } > on where to start looking. } } Interestingly, if I shutdown to single-user mode, the zombie process } gets reaped and disappears! } } So there must be some difference in how init(8) waits during normal } operation and how it waits during the transition to single-user. } } } } +--+--+-+ } | Paul Goyette | PGP Key fingerprint: | E-mail addresses: | } | (Retired)| FA29 0E3B 35AF E8AE 6651 | paul at whooppee.com| } | Kernel Developer | 0786 F758 55DE 53BA 7731 | pgoyette at netbsd.org | } +--+--+-+ >-- End of excerpt from Paul Goyette
Re: Killing a zombie process?
On Wed, 30 Sep 2015, Paul Goyette wrote: # kill -HUP 1 # ps axl | grep ' Z ' 0 27237 1 0 0 0 0 0 - Zpts/2- 0:00.00 (sh) Well, it happened again! I rebooted earlier today, and then deinstalled and rebuilt about 40 packages within the pkgsrc/sysutils/mksandbox environment (all with MAKE_JOBS=3 enabled). After all packages were rebuilt, I exit from the sandbox and run ./sandbox/dismount and get the error umount: /sandbox/bin: Device busy Sure enough, there's a new Zombie process, and its parent seems to be init (PPID==0) # ps axl | grep ' Z ' 0 23848 28120 85 0 4360164 pipe_rd R+ pts/2 0:00.00 grep Z 0 2543910 0 0 0 0 - Zpts/2 0:00.00 (sh) HUPing init still doesn't help. So, I'm pretty sure that there's a bug somewhere, but haven't a clue on where to start looking. Interestingly, if I shutdown to single-user mode, the zombie process gets reaped and disappears! So there must be some difference in how init(8) waits during normal operation and how it waits during the transition to single-user. +--+--+-+ | Paul Goyette | PGP Key fingerprint: | E-mail addresses: | | (Retired)| FA29 0E3B 35AF E8AE 6651 | paul at whooppee.com| | Kernel Developer | 0786 F758 55DE 53BA 7731 | pgoyette at netbsd.org | +--+--+-+
Re: Killing a zombie process?
Date:Wed, 30 Sep 2015 15:55:04 +0800 (PHT) From:Paul GoyetteMessage-ID: | So there must be some difference in how init(8) waits during normal | operation and how it waits during the transition to single-user. Either that (which isn't really all that likely I'd guess) or perhaps the process is not yet linked to init, so can't be waited upon. It needs to be on init's child queue for wait to find it, regardless of what the ppid has been set to. I think I'd be checking out the sequence in the sys_exit() code, to see if there's anything that happens, or could happen, between setting the ppid to 1 and linking the process onto process 1's child list that could perhaps block and cause the zombie to just sit there (for this, once the process status is Z, you can't really trust some of the other ps output, pid and ppid should be correct, but whan is unlikely to have any meaning). kre
Re: Killing a zombie process?
Date:Wed, 30 Sep 2015 18:29:20 +0800 (PHT) From:Paul GoyetteMessage-ID: | Well, a quick read through sbin/init.c shows that sometimes it waits | with WNOHANG and sometimes it doesn't. It is more that init reaps lots of zombie processes, missing just one of them, occasionally, seems unlikely at best, whatever flags it gives wait(). Far more likely (IMO) is that the process in question is special somehow, and the most likely special that would cause wait() to fail to see it, is if the process isn't on init's child process list. There might be other possibilities, if the kernel wait code sometimes ignores zombie processes for some other reason (some other resource still owned, or whatever). | Well, for the previous occurrence, I waited many hours, and the zombie | was still there. (It might even have been as much as a couple of days.) Of course, it won't be time based where your shutdown just happened to occur at the magic interval ... rather, shutdown will be causing some other condition to occur (or be removed) which then allows the zombie process to complete its transition into full zombiehood (???) and for init to then clean it. | If I get really brave, I might even use gdb to attach to init(8) and see | which of the several waitpid() calls is active. I think I'd start with the proc structure of the zombie itself, and see if there's anything unusual about it, see if all the processes resources (like its kernel stack) have truly been freed already, and if not, just where that process is sitting. Since the zombie sits there essentially forever (it seems) it ought to be fairly easy to check this just using gdb on /dev/kmem without interrupting normal operations at all (ie: risk free). On the other hand, checking init's child queue that way would be hard, as it is in a constant state of churn. kre
Re: Killing a zombie process?
On Wed, 30 Sep 2015, Robert Elz wrote: Date:Wed, 30 Sep 2015 15:55:04 +0800 (PHT) From:Paul GoyetteMessage-ID: | So there must be some difference in how init(8) waits during normal | operation and how it waits during the transition to single-user. Either that (which isn't really all that likely I'd guess) ... Well, a quick read through sbin/init.c shows that sometimes it waits with WNOHANG and sometimes it doesn't. I haven't figured out the actual code-flow yet, so I can't tell if this accounts for the steady-state vs transition-to-single-user difference or not. ... or perhaps the process is not yet linked to init, so can't be waited upon. It needs to be on init's child queue for wait to find it, regardless of what the ppid has been set to. Well, for the previous occurrence, I waited many hours, and the zombie was still there. (It might even have been as much as a couple of days.) In today's event, the 'shutdown' transition was run less than one hour after the first notice, and at _that_ time the zombie was reaped. It doesn't seem logical that the ppid gets set, but it gets enqueued only after starting a shutdown. I think I'd be checking out the sequence in the sys_exit() code, to see if there's anything that happens, or could happen, between setting the ppid to 1 and linking the process onto process 1's child list that could perhaps block and cause the zombie to just sit there (for this, once the process status is Z, you can't really trust some of the other ps output, pid and ppid should be correct, but whan is unlikely to have any meaning). Yeah, I'll have a look at the sys_exit() code and see what I can find. If I get really brave, I might even use gdb to attach to init(8) and see which of the several waitpid() calls is active. (I'd prefer to do this in a qemu VM, but then I'd need to reproduce the entire environment inside the VM.) +--+--+-+ | Paul Goyette | PGP Key fingerprint: | E-mail addresses: | | (Retired)| FA29 0E3B 35AF E8AE 6651 | paul at whooppee.com| | Kernel Developer | 0786 F758 55DE 53BA 7731 | pgoyette at netbsd.org | +--+--+-+
Re: kqueue: SIGIO?
> On the other hand, if kernel changes would be needed (for example to > make SIGIO work with kqueue() on NetBSD) then we really should > evaluate whether or not there is a better change that could be made > to handle the situation, rather than just blindly making NetBSD the > same as linux. What that might be though I have no idea. The first thing that comes to mind is a syscall that tells the kernel to deliver signals, or at least certain signals, by changing a memory location rather than arranging to execute code. (I have trouble imagining an architecture on which checking a volatile int variable is more expensive than a syscall into the kernel.) It is true, though, that that's more-than-zero cost in the loop. But it might be close enough to zero to be acceptable. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: kqueue: SIGIO?
On Wed, Sep 30, 2015 at 12:30:36AM +0200, Joerg Sonnenberger wrote: > On Tue, Sep 29, 2015 at 10:09:51PM +0200, Rhialto wrote: > > On Tue 29 Sep 2015 at 13:22:08 +0200, Tobias Nygren wrote: > > > Here is the relevant bit of the talk if you are curious: > > > > > > https://www.youtube.com/watch?v=t400SmZlnO8=youtu.be=1888 > > > > So he wants a signal when a message is available in a kqueue, in other > > words, can be read with kevent(2). > > Why oh why. I thought the X server finally got rid of the > overcomplicated signal handlers. If there is any kind of load going on, > the signal sending is more costly than occassional querying the kqueue > for (other active) entries. If there is no load, it doesn't make a > difference. What he said. I've owned some fairly performance-critical single threaded event driven code in my time, and it is my opinion that trying to use signals to achieve client fairness is almost always a huge mistake. The trick is almost always to structure the event loop so that checking for work from another client is nearly zero-cost, and scales much less than linearly with the number of clients. Given the very high cost of handling a signal, it is pretty darned hard to do worse. The way to get yourself in trouble is to chase false "optimizations" involving processing-to-completion of too much work from a single client at once, or shortcuts involving handing off data directly from one client to another. The latter, at least, are really classic priority inversion bugs in disguise. In practice, the X server has a shared memory transport to most clients and a shared memory interface to the display hardware; it should seldom have syscalls to do. Arranging for nearly zero-cost "look aside" at some other *properly designed and structured* shared memory source of client requests should be pretty easy. Does the problem actually have to do with the mouse and keyboard? Mouse's idea of having the kernel write a flag word instead of interrupting the process seems like a very nice fit if so. -- Thor Lancelot Simont...@panix.com "We cannot usually in social life pursue a single value or a single moral aim, untroubled by the need to compromise with others." - H.L.A. Hart
Re: Killing a zombie process?
On Wed, 30 Sep 2015, Brian Buhrow wrote: Hello. Did you mistype, did I misread or did you really mean to say that the parent pid (ppid) is 0 on the offending zombie process? that could be a clue. The ppid should be 1, not 0. I wonder how, if that is the case, the ppid of 0 gets assigned instead of 1? it's a typo. The parent is init, PPID==1 UID PID PPID CPU PRI NI VSZRSS WCHAN STAT TTY TIME COMMAND 0 27237 1 0 0 0 0 0 - Zpts/2- 0:00.00 (sh) ^^^ -thanks -Brian On Sep 30, 3:55pm, Paul Goyette wrote: } Subject: Re: Killing a zombie process? } On Wed, 30 Sep 2015, Paul Goyette wrote: } } >> # kill -HUP 1 } >> # ps axl | grep ' Z ' } >> 0 27237 1 0 0 0 0 0 - Zpts/2- 0:00.00 } >> (sh) } > } > Well, it happened again! } > } > I rebooted earlier today, and then deinstalled and rebuilt about 40 } > packages within the pkgsrc/sysutils/mksandbox environment (all with } > MAKE_JOBS=3 enabled). After all packages were rebuilt, I exit from } > the sandbox and run ./sandbox/dismount and get the error } > } > umount: /sandbox/bin: Device busy } > } > Sure enough, there's a new Zombie process, and its parent seems to be } > init (PPID==0) } > } > # ps axl | grep ' Z ' } > 0 23848 28120 85 0 4360164 pipe_rd R+ pts/2 0:00.00 } > grep Z } > 0 2543910 0 0 0 0 - Zpts/2 0:00.00 } > (sh) } > } > HUPing init still doesn't help. } > } > So, I'm pretty sure that there's a bug somewhere, but haven't a clue } > on where to start looking. } } Interestingly, if I shutdown to single-user mode, the zombie process } gets reaped and disappears! } } So there must be some difference in how init(8) waits during normal } operation and how it waits during the transition to single-user. } } } } +--+--+-+ } | Paul Goyette | PGP Key fingerprint: | E-mail addresses: | } | (Retired)| FA29 0E3B 35AF E8AE 6651 | paul at whooppee.com| } | Kernel Developer | 0786 F758 55DE 53BA 7731 | pgoyette at netbsd.org | } +--+--+-+ -- End of excerpt from Paul Goyette +--+--+-+ | Paul Goyette | PGP Key fingerprint: | E-mail addresses: | | (Retired)| FA29 0E3B 35AF E8AE 6651 | paul at whooppee.com| | Kernel Developer | 0786 F758 55DE 53BA 7731 | pgoyette at netbsd.org | +--+--+-+
Re: pkgsrc-2015Q3 released
"Thomas Mueller"writes: > Now that pkgsrc-wip has been moved to a git repository, how does a user who > already has pkgsrc-wip by cvs update? > > I checked the URL, http://pkgsrc.org/wip/ , and this was not discussed. > > Or does the user just delete or move the cvs repository and git clone, fresh > start? Basically yes. Howver, you may want to do a final update of the tree From sourceforge and verify you have no uncommitted changes that you want to keep. (If so, you will have to manage them manually.) pgpg8Y7vnXeE4.pgp Description: PGP signature
Re: pkgsrc-2015Q3 released
Now that pkgsrc-wip has been moved to a git repository, how does a user who already has pkgsrc-wip by cvs update? I checked the URL, http://pkgsrc.org/wip/ , and this was not discussed. Or does the user just delete or move the cvs repository and git clone, fresh start? Tom
Re: kqueue: SIGIO?
Date:Wed, 30 Sep 2015 09:45:32 -0400 From:Thor Lancelot SimonMessage-ID: <20150930134532.ga25...@panix.com> | Does the problem actually have to do with the mouse and keyboard? The server also needs to deal with (potential) network connections from clients - most people these days might only run clients on the same system as the server, and so can use shared mem, but not everyone is so limited (I know I run across-net connections, even if it is just from a xen DomU client to the X server running on the Dom0 - but I also do real over ethernet/wireless X connections too on occasion). Those connections will never be the high performance kind, but nor should they be starved by some other local high performance shared-mem using local client. | Mouse's idea of having the kernel | write a flag word instead of interrupting the process seems like a | very nice fit if so. It also fits with the only safe thing that's really possible to do in a single handler being to set a variable and return (or exit the process) (ie: the main loop has to check a variable anyway, whether signal delivery is traditional, or via Mouse's suggested mechanism). The issue with it is how one would ever safely clear the variable again, while avoiding race conditions - when a signal handler sets the variable, it is all user code, and can use locking to be safe, one cannot lock out the kernel though. But maybe, given this is supposed to be rare, a sys call to clear the var, after detecting it set, would be acceptable - or just switch to a different var for subsequent notifications using the original sys call, after which the first one is just a variable again, and can be cleared normally (though that would require an indirect reference to check it, and so greater cost for that.) kre
Re: kqueue: SIGIO?
On Wed, Sep 30, 2015 at 07:37:10AM -0400, Mouse wrote: > > On the other hand, if kernel changes would be needed (for example to > > make SIGIO work with kqueue() on NetBSD) then we really should > > evaluate whether or not there is a better change that could be made > > to handle the situation, rather than just blindly making NetBSD the > > same as linux. What that might be though I have no idea. > > The first thing that comes to mind is a syscall that tells the kernel > to deliver signals, or at least certain signals, by changing a memory > location rather than arranging to execute code. (I have trouble > imagining an architecture on which checking a volatile int variable is > more expensive than a syscall into the kernel.) Well, you can easily get that by just running a second thread that does nothing but monitor the kqueue and deliver notification to the main thread. That's a pretty standard design and all the OpenGL likely has put at least one other thread into the X servere anyway. Joerg
Re: kqueue: SIGIO?
>> Mouse's idea of having the kernel write a flag word instead of >> interrupting the process seems like a very nice fit if so. > The issue with it is how one would ever safely clear the variable > again, [...] This is not difficult: you do it by not clearing the variable. For the sake of argument and brevity, let us suppose a suitable type for the variable in question is unsigned int. Then the kernel, instead of _setting_ the variable, can _increment_ the variable, and userland can do something like volatile unsigned int sigflag; unsigned int chksigflag; unsigned int lastsigflag; sigflag = 0; lastsigflag = 0; handle_via_flag_variable(SIGIO,); // flag to sigaction()? while (1) { // main loop ... chksigflag = sigflag; if (chksigflag != lastsigflag) { lastsigflag = chksigflag; ...handle the signal... } ... } Obviously, there is still a potential issue; if the kernel delivers exactly k*2^32 (for integer k, assuming 32-bit unsigned ints) signals, between userland checks, userland will miss them. I don't consider this a big enough risk to worry about; if it really bothers you, make it long long int instead - there is some risk of value tearing in the read on many architectures, but, since the kernel's increment is atomic with respect to userland, the worst it will do is delay noticing the signal by one trip around the loop. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B