Re: Killing a zombie process?

2015-09-30 Thread Brian Buhrow
Hello.  Did you mistype, did I misread or did you really mean to say
that the parent pid (ppid) is 0 on the offending zombie process?  that
could be a clue.  The ppid should be 1, not 0.  I wonder how, if that is
the case, the ppid of 0 gets assigned instead of 1?
-thanks
-Brian

On Sep 30,  3:55pm, Paul Goyette wrote:
} Subject: Re: Killing a zombie process?
} On Wed, 30 Sep 2015, Paul Goyette wrote:
} 
} >> # kill -HUP 1
} >> # ps axl | grep ' Z '
} >>   0 27237 1 0   0  0   0  0 -   Zpts/2- 0:00.00 
} >> (sh)
} >
} > Well, it happened again!
} >
} > I rebooted earlier today, and then deinstalled and rebuilt about 40
} > packages within the pkgsrc/sysutils/mksandbox environment (all with
} > MAKE_JOBS=3 enabled).  After all packages were rebuilt, I exit from
} > the sandbox and run ./sandbox/dismount and get the error
} >
} > umount: /sandbox/bin: Device busy
} >
} > Sure enough, there's a new Zombie process, and its parent seems to be
} > init  (PPID==0)
} >
} > # ps axl | grep ' Z '
} >0 23848 28120  85  0   4360164 pipe_rd R+   pts/2  0:00.00 
} > grep  Z
} >0 2543910   0  0  0  0 -   Zpts/2  0:00.00 
} > (sh)
} >
} > HUPing init still doesn't help.
} >
} > So, I'm pretty sure that there's a bug somewhere, but haven't a clue
} > on where  to start looking.
} 
} Interestingly, if I shutdown to single-user mode, the zombie process 
} gets reaped and disappears!
} 
} So there must be some difference in how init(8) waits during normal 
} operation and how it waits during the transition to single-user.
} 
} 
} 
} +--+--+-+
} | Paul Goyette | PGP Key fingerprint: | E-mail addresses:   |
} | (Retired)| FA29 0E3B 35AF E8AE 6651 | paul at whooppee.com|
} | Kernel Developer | 0786 F758 55DE 53BA 7731 | pgoyette at netbsd.org  |
} +--+--+-+
>-- End of excerpt from Paul Goyette




Re: Killing a zombie process?

2015-09-30 Thread Paul Goyette

On Wed, 30 Sep 2015, Paul Goyette wrote:


# kill -HUP 1
# ps axl | grep ' Z '
  0 27237 1 0   0  0   0  0 -   Zpts/2- 0:00.00 
(sh)


Well, it happened again!

I rebooted earlier today, and then deinstalled and rebuilt about 40
packages within the pkgsrc/sysutils/mksandbox environment (all with
MAKE_JOBS=3 enabled).  After all packages were rebuilt, I exit from
the sandbox and run ./sandbox/dismount and get the error

umount: /sandbox/bin: Device busy

Sure enough, there's a new Zombie process, and its parent seems to be
init  (PPID==0)

# ps axl | grep ' Z '
	   0 23848 28120  85  0   4360164 pipe_rd R+   pts/2  0:00.00 
grep  Z
	   0 2543910   0  0  0  0 -   Zpts/2  0:00.00 
(sh)


HUPing init still doesn't help.

So, I'm pretty sure that there's a bug somewhere, but haven't a clue
on where  to start looking.


Interestingly, if I shutdown to single-user mode, the zombie process 
gets reaped and disappears!


So there must be some difference in how init(8) waits during normal 
operation and how it waits during the transition to single-user.




+--+--+-+
| Paul Goyette | PGP Key fingerprint: | E-mail addresses:   |
| (Retired)| FA29 0E3B 35AF E8AE 6651 | paul at whooppee.com|
| Kernel Developer | 0786 F758 55DE 53BA 7731 | pgoyette at netbsd.org  |
+--+--+-+


Re: Killing a zombie process?

2015-09-30 Thread Robert Elz
Date:Wed, 30 Sep 2015 15:55:04 +0800 (PHT)
From:Paul Goyette 
Message-ID:  

  | So there must be some difference in how init(8) waits during normal 
  | operation and how it waits during the transition to single-user.

Either that (which isn't really all that likely I'd guess) or perhaps
the process is not yet linked to init, so can't be waited upon.   It
needs to be on init's child queue for wait to find it, regardless of
what the ppid has been set to.

I think I'd be checking out the sequence in the sys_exit() code, to see if
there's anything that happens, or could happen, between setting the ppid to 1
and linking the process onto process 1's child list that could perhaps block
and cause the zombie to just sit there (for this, once the process status
is Z, you can't really trust some of the other ps output, pid and ppid
should be correct, but whan is unlikely to have any meaning).

kre



Re: Killing a zombie process?

2015-09-30 Thread Robert Elz
Date:Wed, 30 Sep 2015 18:29:20 +0800 (PHT)
From:Paul Goyette 
Message-ID:  

  | Well, a quick read through sbin/init.c shows that sometimes it waits 
  | with WNOHANG and sometimes it doesn't.

It is more that init reaps lots of zombie processes, missing just one of
them, occasionally, seems unlikely at best, whatever flags it gives wait().

Far more likely (IMO) is that the process in question is special somehow,
and the most likely special that would cause wait() to fail to see it, is
if the process isn't on init's child process list.   There might be
other possibilities, if the kernel wait code sometimes ignores zombie
processes for some other reason (some other resource still owned, or whatever).

  | Well, for the previous occurrence, I waited many hours, and the zombie 
  | was still there.  (It might even have been as much as a couple of days.)

Of course, it won't be time based where your shutdown just happened to
occur at the magic interval ... rather, shutdown will be causing some
other condition to occur (or be removed) which then allows the zombie
process to complete its transition into full zombiehood (???) and for
init to then clean it.

  | If I get really brave, I might even use gdb to attach to init(8) and see 
  | which of the several waitpid() calls is active.

I think I'd start with the proc structure of the zombie itself, and see
if there's anything unusual about it, see if all the processes resources
(like its kernel stack) have truly been freed already, and if not, just where
that process is sitting.   Since the zombie sits there essentially
forever (it seems) it ought to be fairly easy to check this just using
gdb on /dev/kmem without interrupting normal operations at all (ie: risk free).

On the other hand, checking init's child queue that way would be hard, as it
is in a constant state of churn.

kre



Re: Killing a zombie process?

2015-09-30 Thread Paul Goyette

On Wed, 30 Sep 2015, Robert Elz wrote:


   Date:Wed, 30 Sep 2015 15:55:04 +0800 (PHT)
   From:Paul Goyette 
   Message-ID:  

 | So there must be some difference in how init(8) waits during normal
 | operation and how it waits during the transition to single-user.

Either that (which isn't really all that likely I'd guess) ...


Well, a quick read through sbin/init.c shows that sometimes it waits 
with WNOHANG and sometimes it doesn't.  I haven't figured out the actual 
code-flow yet, so I can't tell if this accounts for the steady-state vs 
transition-to-single-user difference or not.



... or perhaps
the process is not yet linked to init, so can't be waited upon.   It
needs to be on init's child queue for wait to find it, regardless of
what the ppid has been set to.


Well, for the previous occurrence, I waited many hours, and the zombie 
was still there.  (It might even have been as much as a couple of days.) 
In today's event, the 'shutdown' transition was run less than one hour 
after the first notice, and at _that_ time the zombie was reaped.  It 
doesn't seem logical that the ppid gets set, but it gets enqueued only 
after starting a shutdown.



I think I'd be checking out the sequence in the sys_exit() code, to see
if there's anything that happens, or could happen, between setting
the ppid to 1 and linking the process onto process 1's child list that
could perhaps block and cause the zombie to just sit there (for this,
once the process status is Z, you can't really trust some of the other
ps output, pid and ppid should be correct, but whan is unlikely to have
any meaning).


Yeah, I'll have a look at the sys_exit() code and see what I can find. 
If I get really brave, I might even use gdb to attach to init(8) and see 
which of the several waitpid() calls is active.  (I'd prefer to do this 
in a qemu VM, but then I'd need to reproduce the entire environment 
inside the VM.)




+--+--+-+
| Paul Goyette | PGP Key fingerprint: | E-mail addresses:   |
| (Retired)| FA29 0E3B 35AF E8AE 6651 | paul at whooppee.com|
| Kernel Developer | 0786 F758 55DE 53BA 7731 | pgoyette at netbsd.org  |
+--+--+-+


Re: kqueue: SIGIO?

2015-09-30 Thread Mouse
> On the other hand, if kernel changes would be needed (for example to
> make SIGIO work with kqueue() on NetBSD) then we really should
> evaluate whether or not there is a better change that could be made
> to handle the situation, rather than just blindly making NetBSD the
> same as linux.   What that might be though I have no idea.

The first thing that comes to mind is a syscall that tells the kernel
to deliver signals, or at least certain signals, by changing a memory
location rather than arranging to execute code.  (I have trouble
imagining an architecture on which checking a volatile int variable is
more expensive than a syscall into the kernel.)

It is true, though, that that's more-than-zero cost in the loop.  But
it might be close enough to zero to be acceptable.

/~\ The ASCII Mouse
\ / Ribbon Campaign
 X  Against HTMLmo...@rodents-montreal.org
/ \ Email!   7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B


Re: kqueue: SIGIO?

2015-09-30 Thread Thor Lancelot Simon
On Wed, Sep 30, 2015 at 12:30:36AM +0200, Joerg Sonnenberger wrote:
> On Tue, Sep 29, 2015 at 10:09:51PM +0200, Rhialto wrote:
> > On Tue 29 Sep 2015 at 13:22:08 +0200, Tobias Nygren wrote:
> > > Here is the relevant bit of the talk if you are curious:
> > > 
> > > https://www.youtube.com/watch?v=t400SmZlnO8=youtu.be=1888
> > 
> > So he wants a signal when a message is available in a kqueue, in other
> > words, can be read with kevent(2).
> 
> Why oh why. I thought the X server finally got rid of the
> overcomplicated signal handlers. If there is any kind of load going on,
> the signal sending is more costly than occassional querying the kqueue
> for (other active) entries. If there is no load, it doesn't make a
> difference.

What he said.  I've owned some fairly performance-critical single threaded
event driven code in my time, and it is my opinion that trying to use
signals to achieve client fairness is almost always a huge mistake.

The trick is almost always to structure the event loop so that checking
for work from another client is nearly zero-cost, and scales much less
than linearly with the number of clients.  Given the very high cost of
handling a signal, it is pretty darned hard to do worse.

The way to get yourself in trouble is to chase false "optimizations"
involving processing-to-completion of too much work from a single client
at once, or shortcuts involving handing off data directly from one
client to another.  The latter, at least, are really classic priority
inversion bugs in disguise.

In practice, the X server has a shared memory transport to most clients
and a shared memory interface to the display hardware; it should seldom
have syscalls to do.  Arranging for nearly zero-cost "look aside" at
some other *properly designed and structured* shared memory source of
client requests should be pretty easy.  Does the problem actually have
to do with the mouse and keyboard?  Mouse's idea of having the kernel
write a flag word instead of interrupting the process seems like a 
very nice fit if so.

-- 
  Thor Lancelot Simont...@panix.com

  "We cannot usually in social life pursue a single value or a single moral
   aim, untroubled by the need to compromise with others."  - H.L.A. Hart


Re: Killing a zombie process?

2015-09-30 Thread Paul Goyette

On Wed, 30 Sep 2015, Brian Buhrow wrote:


Hello.  Did you mistype, did I misread or did you really mean to say
that the parent pid (ppid) is 0 on the offending zombie process?  that
could be a clue.  The ppid should be 1, not 0.  I wonder how, if that is
the case, the ppid of 0 gets assigned instead of 1?


it's a typo.  The parent is init, PPID==1

 UID   PID  PPID   CPU PRI NI VSZRSS WCHAN   STAT TTY   TIME COMMAND
   0 27237 1 0   0  0   0  0 -   Zpts/2- 0:00.00 (sh)
 ^^^




-thanks
-Brian

On Sep 30,  3:55pm, Paul Goyette wrote:
} Subject: Re: Killing a zombie process?
} On Wed, 30 Sep 2015, Paul Goyette wrote:
}
} >> # kill -HUP 1
} >> # ps axl | grep ' Z '
} >>   0 27237 1 0   0  0   0  0 -   Zpts/2- 0:00.00
} >> (sh)
} >
} > Well, it happened again!
} >
} > I rebooted earlier today, and then deinstalled and rebuilt about 40
} > packages within the pkgsrc/sysutils/mksandbox environment (all with
} > MAKE_JOBS=3 enabled).  After all packages were rebuilt, I exit from
} > the sandbox and run ./sandbox/dismount and get the error
} >
} >  umount: /sandbox/bin: Device busy
} >
} > Sure enough, there's a new Zombie process, and its parent seems to be
} > init  (PPID==0)
} >
} >  # ps axl | grep ' Z '
} > 0 23848 28120  85  0   4360164 pipe_rd R+   pts/2  0:00.00
} > grep  Z
} > 0 2543910   0  0  0  0 -   Zpts/2  0:00.00
} > (sh)
} >
} > HUPing init still doesn't help.
} >
} > So, I'm pretty sure that there's a bug somewhere, but haven't a clue
} > on where  to start looking.
}
} Interestingly, if I shutdown to single-user mode, the zombie process
} gets reaped and disappears!
}
} So there must be some difference in how init(8) waits during normal
} operation and how it waits during the transition to single-user.
}
}
}
} +--+--+-+
} | Paul Goyette | PGP Key fingerprint: | E-mail addresses:   |
} | (Retired)| FA29 0E3B 35AF E8AE 6651 | paul at whooppee.com|
} | Kernel Developer | 0786 F758 55DE 53BA 7731 | pgoyette at netbsd.org  |
} +--+--+-+

-- End of excerpt from Paul Goyette






+--+--+-+
| Paul Goyette | PGP Key fingerprint: | E-mail addresses:   |
| (Retired)| FA29 0E3B 35AF E8AE 6651 | paul at whooppee.com|
| Kernel Developer | 0786 F758 55DE 53BA 7731 | pgoyette at netbsd.org  |
+--+--+-+


Re: pkgsrc-2015Q3 released

2015-09-30 Thread Greg Troxel

"Thomas Mueller"  writes:

> Now that pkgsrc-wip has been moved to a git repository, how does a user who 
> already has pkgsrc-wip by cvs update?
>
> I checked the URL, http://pkgsrc.org/wip/ , and this was not discussed.
>
> Or does the user just delete or move the cvs repository and git clone, fresh 
> start?

Basically yes.  Howver, you may want to do a final update of the tree
From sourceforge and verify you have no uncommitted changes that you
want to keep.  (If so, you will have to manage them manually.)


pgpg8Y7vnXeE4.pgp
Description: PGP signature


Re: pkgsrc-2015Q3 released

2015-09-30 Thread Thomas Mueller
Now that pkgsrc-wip has been moved to a git repository, how does a user who 
already has pkgsrc-wip by cvs update?

I checked the URL, http://pkgsrc.org/wip/ , and this was not discussed.

Or does the user just delete or move the cvs repository and git clone, fresh 
start?

Tom



Re: kqueue: SIGIO?

2015-09-30 Thread Robert Elz
Date:Wed, 30 Sep 2015 09:45:32 -0400
From:Thor Lancelot Simon 
Message-ID:  <20150930134532.ga25...@panix.com>

  | Does the problem actually have to do with the mouse and keyboard?

The server also needs to deal with (potential) network connections from
clients - most people these days might only run clients on the same system
as the server, and so can use shared mem, but not everyone is so limited
(I know I run across-net connections, even if it is just from a xen DomU
client to the X server running on the Dom0 - but I also do real over
ethernet/wireless X connections too on occasion).   Those connections
will never be the high performance kind, but nor should they be starved
by some other local high performance shared-mem using local client.

  | Mouse's idea of having the kernel
  | write a flag word instead of interrupting the process seems like a 
  | very nice fit if so.

It also fits with the only safe thing that's really possible to do in a
single handler being to set a variable and return (or exit the process)
(ie: the main loop has to check a variable anyway, whether signal delivery
is traditional, or via Mouse's suggested mechanism).

The issue with it is how one would ever safely clear the variable again,
while avoiding race conditions - when a signal handler sets the variable,
it is all user code, and can use locking to be safe, one cannot lock out the
kernel though.   But maybe, given this is supposed to be rare, a sys call
to clear the var, after detecting it set, would be acceptable - or just
switch to a different var for subsequent notifications using the original
sys call, after which the first one is just a variable again, and can be
cleared normally (though that would require an indirect reference to
check it, and so greater cost for that.)

kre



Re: kqueue: SIGIO?

2015-09-30 Thread Joerg Sonnenberger
On Wed, Sep 30, 2015 at 07:37:10AM -0400, Mouse wrote:
> > On the other hand, if kernel changes would be needed (for example to
> > make SIGIO work with kqueue() on NetBSD) then we really should
> > evaluate whether or not there is a better change that could be made
> > to handle the situation, rather than just blindly making NetBSD the
> > same as linux.   What that might be though I have no idea.
> 
> The first thing that comes to mind is a syscall that tells the kernel
> to deliver signals, or at least certain signals, by changing a memory
> location rather than arranging to execute code.  (I have trouble
> imagining an architecture on which checking a volatile int variable is
> more expensive than a syscall into the kernel.)

Well, you can easily get that by just running a second thread that does
nothing but monitor the kqueue and deliver notification to the main
thread. That's a pretty standard design and all the OpenGL likely has
put at least one other thread into the X servere anyway.

Joerg


Re: kqueue: SIGIO?

2015-09-30 Thread Mouse
>> Mouse's idea of having the kernel write a flag word instead of
>> interrupting the process seems like a very nice fit if so.
> The issue with it is how one would ever safely clear the variable
> again, [...]

This is not difficult: you do it by not clearing the variable.

For the sake of argument and brevity, let us suppose a suitable type
for the variable in question is unsigned int.  Then the kernel, instead
of _setting_ the variable, can _increment_ the variable, and userland
can do something like

volatile unsigned int sigflag;
unsigned int chksigflag;
unsigned int lastsigflag;

sigflag = 0;
lastsigflag = 0;
handle_via_flag_variable(SIGIO,); // flag to sigaction()?
while (1) { // main loop
...
chksigflag = sigflag;
if (chksigflag != lastsigflag) {
lastsigflag = chksigflag;
...handle the signal...
}
...
}

Obviously, there is still a potential issue; if the kernel delivers
exactly k*2^32 (for integer k, assuming 32-bit unsigned ints) signals,
between userland checks, userland will miss them.  I don't consider
this a big enough risk to worry about; if it really bothers you, make
it long long int instead - there is some risk of value tearing in the
read on many architectures, but, since the kernel's increment is atomic
with respect to userland, the worst it will do is delay noticing the
signal by one trip around the loop.

/~\ The ASCII Mouse
\ / Ribbon Campaign
 X  Against HTMLmo...@rodents-montreal.org
/ \ Email!   7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B