Re: [munge-users] munged running at 100%

Chris Dunlap Wed, 20 Dec 2006 18:30:23 -0800

[ I'm reposting this because Jeff's mail is still getting marked ]
[ as spam by the Gna! mail server.  -Chris                       ]

Date:    Wed, 20 Dec 2006 21:00:58 -0500
From:    Jeff Squyres <[EMAIL PROTECTED]>
To:      Chris Dunlap <[EMAIL PROTECTED]>
cc:      [email protected]
Subject: Re: [munge-users] munged running at 100%

On Dec 20, 2006, at 3:15 PM, Chris Dunlap wrote:

> I'll release these changes in 0.5.7 either today or tomorrow.

Awesome.

> Out of curiosity, how long does it take your back-end node to compute
> the gids map now that I'm caching the uid lookups?

I'm not sure what you're asking me to time here -- the execution of  
the _gids_hash_create() function?  I don't see caching going on in  
there, so I suspect you mean something else...

FWIW, I found out that my cluster installer software (OSCAR) was  
overwriting my /etc/group and /etc/passwd files on all my back-end  
nodes every 15 minutes.  And it was overwriting them with the output  
from "getent group" / "getenv passwd" on my cluster head node.  This  
resulted in group / passwd files that contained the original skeleton  
group/passwd file *and* all the entries from NIS.

getgrent(), therefore, saw the entire group/passwd files *and* all  
the NIS entries.  That is, it essentially saw every entry from NIS  
listed twice (once in the file, and once in NIS).  Yoinks.  I  
performed the following test to check the performance differences:

0. I disabled OSCAR's refreshing of /etc/group and /etc/passwd.

1. Write a short C program that essentially duplicated the getgrent()  
loop in gids.c that also showed timing information.

2. With a "full" /etc/passwd and /etc/group on a back-end compute  
node (i.e., what OSCAR put there as a result of getent(1)), run the  
test program.  It completed in 606 seconds.

3. Replace /etc/passwd and /etc/group with the skeletal versions that  
they are supposed to be (i.e., no duplicated entries from NIS) and  
run the test program.  It completed in 17 seconds.

4. Additionally, running munged with all the default settings (i.e.,  
letting it generate the gid map) when /etc/passwd and etc/group had  
only the skeletal entries only showed significant CPU load for the  
first 15-20 seconds of its execution, and then it went to a CPU load  
of 0.  This assumedly jives well with #3; the initial munged load is  
when its building the initial gid map.  I did not re-run the munged  
with "full" versions of passwd/group, but I assume that, per #2, I'd  
see heavy load from munged for about 600 seconds.  If OSCAR refreshed  
the /etc/group file, munged's timer would expire a short time later  
and it would simply turn around and re-create the gid map again.   
Otherwise, this initial heavy load would be a one-time expense and  
munged's CPU load would go to 0.

I conclude that having these "wrong" passwd/group files does two things:

1. Make the getgrent() process take much longer
2. Make the getgrent() process much more CPU-intensive

 #1 is possibly as result of #2, but there are probably other factors  
involved (perhaps when the majority of the entries are in NIS, most  
of the work is blocking on network access, not local CPU processing,  
etc...?).

So -- in short -- I think this whole hubaloo was due to OSCAR  
inadvertently doing the wrong thing in an NIS environment.  Not  
munge's fault at all.  I'll be mailing the OSCAR list with details  
shortly...

> -Chris
>
>
> On Wed, 12/20/06 07:55a EST, Jeff Squyres wrote:
>>
>> Chris --
>>
>> This worked perfectly, thanks!
>>
>> I added OPTIONS="-f --group-update-time=-1" to my sysconfig/munge
>> file, and it works like a champ.
>>
>> Will you do a 0.5.7 release with this new stuff?  I'm perfectly happy
>> to continue using a snapshot; whatever is easiest for you.
>>
>> I also did a little more poking around w.r.t. NIS (I am a NIS newbie
>> -- forgive me; there's probably some fairly obvious controls for  
>> this  
>> stuff somewhere that I'm unaware of).  I found that my /etc/group
>> file *is* changing, so munge was doing exactly the Right Thing in re-
>> creating the hash map.  Specifically, NIS seems to be updating my /
>> etc/group file to be the NIS group file every so often.  I don't know
>> where/when this is happening yet, but I'll be tracking it down.
>>
>> Again, I want to stress that I think the majority of these issues are
>> problems with my local setup (the fact that I'm an NIS newbie is
>> probably contributing to the problems...), but I deeply appreciate
>> the workarounds that I now have in munge.  Thanks!
>>
>> If there's any further testing that you'd like in an NIS environment,
>> feel free to ask.

-- 
Jeff Squyres
Server Virtualization Business Unit
Cisco Systems

_______________________________________________
munge-users mailing list
[email protected]
https://mail.gna.org/listinfo/munge-users

Re: [munge-users] munged running at 100%

Reply via email to