[osol-help] Total system hang caused by makedb in NIS Makefile?

Thor Simon Fri, 22 Sep 2006 22:28:05 -0700

I'm running build 44 on a dual dual-core Opteron.  Root, /usr, and swap are on 
SVM mirrors on a pair of SATA disks on a SiI 3114; the machine serves about a 
hundred NFS and SMB clients from two large UFS filesystems on Apple Xraids 
connected by LSI FibreChannel cards; it is also our master NIS server.


Recently the system has begun to misbehave in a truly spectacular and bizarre 
way when I run 'make' in /var/yp.  Our only modification to /var/yp/Makefile 
has been to use /etc/nis instead of /etc as the source directory.  We see the 
following misbehavior whether or not the server system itself is using nis for 
_anything_ in nsswitch.conf, and even if we ypstop before running the make:

Within a second or so of our running make, if passwd needs to be rebuilt, make 
completes processing for passwd.byname and begins processing passwd.byuid, 
using nawk to merge passwd and shadow and feeding the result to makedbm, just 
as it did for passwd.byname.

But this time, all other processes on the system except makedbm quickly grind 
to an (unkillable!) halt (we suspect it's as soon as they try to do any disk 
I/O but we are not sure).  makedbm itself (according to truss, anyway) seems to 
write to the new dbm file a few times a second, consuming (according to a copy 
of top we left running -- once this starts, we can't run any new diagnostic 
tools; the shell hangs!) about 25% of one CPU.

If left alone this will continue for _hours_: makedb mslooooowly writes the 
byuid DBM files, and nothing else on the system does any work, seemingly 
because it can't do any I/O.

We have about 4000 users in our passwd file.  I know there's an issue with keys 
in the DBM files used by NIS exceeding the 1024-byte limit embedded in ndbm 
itself but that does not seem to be what is happening here; makedbm doesn't 
fail, instead the whole system grinds to a halt.  I can reproduce this with our 
password file truncated to as few as 250 users, though oddly enough it does not 
happen if I chop the file at 225 users and the syntax of the passwd and shadow 
records for lines 225-250 is OK.

Has anyone ever seen anything like this before?  I suspect a volume manager bug 
-- it's all I can think of, honestly -- since we have root, usr, and swap on 
mirrored volumes -- but I cannot imagine what could suddenly be triggering it.
 
 
This message posted from opensolaris.org
_______________________________________________
opensolaris-help mailing list
opensolaris-help@opensolaris.org

[osol-help] Total system hang caused by makedb in NIS Makefile?

Reply via email to