[ I'm reposting this because Jeff's mail is still getting marked ] [ as spam by the Gna! mail server. -Chris ]
Date: Wed, 20 Dec 2006 21:00:58 -0500 From: Jeff Squyres <[EMAIL PROTECTED]> To: Chris Dunlap <[EMAIL PROTECTED]> cc: [email protected] Subject: Re: [munge-users] munged running at 100% On Dec 20, 2006, at 3:15 PM, Chris Dunlap wrote: > I'll release these changes in 0.5.7 either today or tomorrow. Awesome. > Out of curiosity, how long does it take your back-end node to compute > the gids map now that I'm caching the uid lookups? I'm not sure what you're asking me to time here -- the execution of the _gids_hash_create() function? I don't see caching going on in there, so I suspect you mean something else... FWIW, I found out that my cluster installer software (OSCAR) was overwriting my /etc/group and /etc/passwd files on all my back-end nodes every 15 minutes. And it was overwriting them with the output from "getent group" / "getenv passwd" on my cluster head node. This resulted in group / passwd files that contained the original skeleton group/passwd file *and* all the entries from NIS. getgrent(), therefore, saw the entire group/passwd files *and* all the NIS entries. That is, it essentially saw every entry from NIS listed twice (once in the file, and once in NIS). Yoinks. I performed the following test to check the performance differences: 0. I disabled OSCAR's refreshing of /etc/group and /etc/passwd. 1. Write a short C program that essentially duplicated the getgrent() loop in gids.c that also showed timing information. 2. With a "full" /etc/passwd and /etc/group on a back-end compute node (i.e., what OSCAR put there as a result of getent(1)), run the test program. It completed in 606 seconds. 3. Replace /etc/passwd and /etc/group with the skeletal versions that they are supposed to be (i.e., no duplicated entries from NIS) and run the test program. It completed in 17 seconds. 4. Additionally, running munged with all the default settings (i.e., letting it generate the gid map) when /etc/passwd and etc/group had only the skeletal entries only showed significant CPU load for the first 15-20 seconds of its execution, and then it went to a CPU load of 0. This assumedly jives well with #3; the initial munged load is when its building the initial gid map. I did not re-run the munged with "full" versions of passwd/group, but I assume that, per #2, I'd see heavy load from munged for about 600 seconds. If OSCAR refreshed the /etc/group file, munged's timer would expire a short time later and it would simply turn around and re-create the gid map again. Otherwise, this initial heavy load would be a one-time expense and munged's CPU load would go to 0. I conclude that having these "wrong" passwd/group files does two things: 1. Make the getgrent() process take much longer 2. Make the getgrent() process much more CPU-intensive #1 is possibly as result of #2, but there are probably other factors involved (perhaps when the majority of the entries are in NIS, most of the work is blocking on network access, not local CPU processing, etc...?). So -- in short -- I think this whole hubaloo was due to OSCAR inadvertently doing the wrong thing in an NIS environment. Not munge's fault at all. I'll be mailing the OSCAR list with details shortly... > -Chris > > > On Wed, 12/20/06 07:55a EST, Jeff Squyres wrote: >> >> Chris -- >> >> This worked perfectly, thanks! >> >> I added OPTIONS="-f --group-update-time=-1" to my sysconfig/munge >> file, and it works like a champ. >> >> Will you do a 0.5.7 release with this new stuff? I'm perfectly happy >> to continue using a snapshot; whatever is easiest for you. >> >> I also did a little more poking around w.r.t. NIS (I am a NIS newbie >> -- forgive me; there's probably some fairly obvious controls for >> this >> stuff somewhere that I'm unaware of). I found that my /etc/group >> file *is* changing, so munge was doing exactly the Right Thing in re- >> creating the hash map. Specifically, NIS seems to be updating my / >> etc/group file to be the NIS group file every so often. I don't know >> where/when this is happening yet, but I'll be tracking it down. >> >> Again, I want to stress that I think the majority of these issues are >> problems with my local setup (the fact that I'm an NIS newbie is >> probably contributing to the problems...), but I deeply appreciate >> the workarounds that I now have in munge. Thanks! >> >> If there's any further testing that you'd like in an NIS environment, >> feel free to ask. -- Jeff Squyres Server Virtualization Business Unit Cisco Systems _______________________________________________ munge-users mailing list [email protected] https://mail.gna.org/listinfo/munge-users
