Re: (toughed out) Re: reboot/halt/shutdown does nothing

2014-02-18 Thread Marc Auslander
At one point you reported that reboot did nothing.
Was that reboot -f or just reboot - which calls shutdown if you're
running at 0 or 6 according to the man page.


-- 
To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org 
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/87a9doeaij@aptiva.optonline.net



(toughed out) Re: reboot/halt/shutdown does nothing

2014-02-17 Thread Zhang Weiwu



On Tue, 18 Feb 2014, Zhang Weiwu wrote:


I have exclude another possibility.

I am thinking:

1) perhaps the message in /var/log/messages is not produced by init,
but by reboot/halt/shutdown, and

2) perhaps init is not invoked at all.

So I run 'init 6' as root. This time, there is no new message in 
/var/log/messages, prooving 1), and 'init 6' did absolutely nothing, 
disprooving 2).


I was wrong. init 6 behave differently than reboot/halt/shutdown. It did 
shutdown a lot of services - my last post was sent a few seconds too early.


Among the services 'init 6' shutdown (which reboot/halt/shutdown failed) 
are:


- (in /etc/rc.6.d) apache2 
- (in /etc/rc.6.d) mysql

- (in /etc/rc.6.d) exim4

The services 'init 6' did NOT shutdown are:

- portmap (manual break /etc/rc6.d/K06portmap stop worked)
- networking (because I can still establish new ssh connection to this server)
- rsyslogd

I have:

$ ls /etc/rc6.d/
K01apache2K02mysql K06portmap K10lvm2
K01atdK03sendsigs  K07hwclock.sh  K11umountroot
K01exim4  K04rsyslog   K07networking  K12reboot
K01urandomK05umountnfs.sh  K08ifupdownREADME
K01xe-linux-distribution  K06nfs-commonK09umountfs

My suspecision is that K03sendsigs failed, because K02* was terminated, K04* 
and K06portmap wasn't. K03sendsigs is in between. ps(1) shows sendsigs 
running:


$ ps ax | grep init
1 ?Ss 0:39 init [6] 19299 ?Ss 0:00 /bin/sh 
/etc/init.d/rc 6

19401 ?S  0:00 /bin/sh /etc/init.d/sendsigs stop
23319 pts/9S+ 0:00 grep init

So the task is to figure out what sendsigs does and why it hangs.

There is no manual, so I go the hard way to read its source: It does the 
Asking all remaining processes to terminate thing.


So I suppose some daemon refuse to succumb, and sendsigs is waiting for it, or 
failed to kill nastily and is thus confused. I look at /var/run:


$ ls -F /var/run/
apache2/  ldapi@   portmap.pidscreen/sshd.pid
crond.pid motd portmap.state  slapd/ utmp
crond.reboot  mysqld/  rpc.statd.pid  sm-notify.pid  xe-daemon.pid
exim4/portmap_mapping  rsyslogd.pid   sshd/

portmap was manually stopped, therefore, daemons don't always remove pid 
before they leave, and the remaining files in /var/run does not indicate 
daemons who refuse to die.


Did sendsigs spit any error message? There were none in /var/log/syslog and 
/var/log/messages. Another user reported seeing error on screen from sendsigs 
while not able to finding it in both log files, so it is not logged there:

I am operating a remote server, there is no screen for me to see.

His problem may be the same as mine. As he solved it, he post:
http://forums.debian.net/viewtopic.php?f=5t=63896

A check forced of filesystem solved the problem.

I meditated for a while on this check forced of filesystem, the grammar 
isn't correct and the whole sentence makes no sense. Does he mean reboot -f 
to force reboot? I have tried that and didn't make any difference than 
reboot without -f. Does he mean manually umount all non-root filesystem? 
My /var/local is the only non-root physical file-system, and it is in use. 
'sudo lsof /var/local' hangs there for 1 hour, so it remain a mystery which 
process is using it, but accessing its files is fine and error-free. Besides, 
there are various *umount* in /etc/rc.6d/ and they are all ordered after 
sendsigs, so they are not suposed to cause problem until sendsigs finishes.


So deadend again. Now as I browse through the process tree, I found one 
process that is started 2 weeks ago and should be long dead:


$ ps ax | grep youtu
18380 ?D  0:03 python /usr/local/bin/youtube-dl

I distantly remember it had been run on a NFS mount which was jammed, and 
later, because umount not possible (NFS server gone), I had done lazy umount:

# umount -l /mnt/nfs

So I believe this one the culprit. kill -9 cannot kill it, confirming my 
guess. https://wiki.debian.org/Kill says if you can't kill with kill -9, you 
should reboot, which brings me back to this problem, chicken or egg first?


With no way to kill 18380 but to reboot, and no way to reboot but to kill 
18380, I instead killed sendsigs with -TERM.  The result is trouble: I was 
immediately kicked out of ssh session, server stopped to responding PING, and 
half an hour later I capitulated and called datacenter for a cold reboot.


After the server is online again, I immediately did a reboot and succeeded. 
So, it is very likely the stall process 18380 that stems reboot/shutdown.


My conclusion so far:

1. If you had an NFS mount, and NFS server is gone, you cannot umount it 
unless you reboot, which won't be successful and you need to do cold reboot.


2. You can get NFS mount out of sight with lazy umount (umount -l) but they 
are still there holding any process that uses it. I waited 2