Hi Howard, >> At a guess, based on the minimal amount of information here, you've run into >> the glibc malloc fragmentation issue, >> and switching to tcmalloc might avoid the problem. What's the quickest way to validate this on the running-at-99%-slapd, prior to falling back on tcmalloc? Can the proc's smaps reveal this? Like if we're seeing loads many 64MB regions?
Thanks ++Cyrille -----Original Message----- From: [email protected] [mailto:[email protected]] On Behalf Of Howard Chu Sent: Friday, March 16, 2012 8:32 AM To: Jeffrey Crawford Cc: OpenLDAP technical list Subject: Re: OpenLDAP high CPU usage when performing mass changes Jeffrey Crawford wrote: > We are using openldap 2.4.26 with BDB 4.8 and have replication set up > in mirror mode for our main ldap database. There are a couple of other > replicas that have a subset of the data that the main cluster has but > we are seeing the following behavior on all of them. > > When performing mass updates via LDAP, lets say on the order of 30,000 > entries being added to existing entries. We've noticed that the CPU > use of the slapd instances goes through the roof (between 65% and 95% > continuously), and seems to stay there until it is restarted. When the CPU usage goes high like that it should be pretty easy to see where it's going, by getting a gdb stack trace of the running process. At a guess, based on the minimal amount of information here, you've run into the glibc malloc fragmentation issue, and switching to tcmalloc might avoid the problem. > The Problem is that this system has to be highly available, even for > writing and when these updates "shock" the system, the response time > goes way down when the process are turning like that. I don't think > they are trying to catch up to the data changes because if I let them > run a while after the updates are done. (Talking like 1hr) and then > restart the instances, they go back to their normal state. If you have the SYNC loglevel enabled, it should be obvious whether update traffic is the cause or not. > So far the only way I've been able to mitigate the issues is to > reconfigure our ldap proxy instances to a machine that is having less > trouble, restart the instances that are chugging along, then repoint > the proxies back to the one just started, and start the others. Not exactly a > quick operation. > > I've played with cache settings for both OpenLDAP and BDB and have > gotten the frequency of this issue reduced but I can't seem to get rid > of it completely and it shows up quite often after large data > manipulations. I'm at a loss of how to debug since nothing is > crashing. Any suggestions on how to find out what's causing this would > be very helpful. The logs are not throwing any warnings or posting > messages that would seem out of the ordinary and I have played with > the log settings but nothing seems to relate to anything that might explain > why we are seeing CPU usage to go so high. I would suggest you try out back-mdb in RE24. MDB uses 1/4 the total memory of BDB and it performs far fewer mallocs, so glibc malloc fragmentation should not be a problem. (I would have suggested 2.4.30, but the ITS#7190 fix is rather important if you have large volumes of delete operations. The other MDB-related ITSs, #7191 and #7196, are only crucial for non-X86 and non-Linux platforms.) -- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/
