What is the test filter like? Can we see a sanitized sample of the access log with the SRCH and RESULT?
If using SSL, review the output of cat /proc/sys/kernel/random/entropy_avail Do we have replication? (and large attribute values?) You may want to run the "dbmon.sh" script to monitor cache usage for db cache and entry cache, try to capture a few samples of line about dbcachefree and userroot:ent (if the db with the problems is userroot), when the searches are becoming too long, like this example: INCR=1 HOST=m2.example.com BINDDN="cn=directory manager" BINDPW="password" VERBOSE=2 /usr/sbin/dbmon.sh and review the ns-slapd errors and system messages log files for any unusual activity. what is the ns-slapd memory foot print from restart to slow responses? any "too high" disk i/o? (or "bad" ssd?) Thanks, M. On Tue, Nov 15, 2016 at 11:40 AM, Gordon Messmer <[email protected]> wrote: > I'm trying to track down a problem we are seeing on two relatively lightly > used instances on CentOS 7 (and previously on CentOS 6, which is no longer > in use). Our servers have 3624 entries according to last night's export > (we export userRoot daily). There are currently just over 400 connections > established to each server. > > We have a local cron job that runs every 5 minutes that performs a simple > query. If it takes more than 7 seconds to get an answer, the attempt is > aborted and another query issued. If three consecutive test fail, the > directory server is restarted. > > The issue we're seeing is that the longer the system is up, the more often > checks will fail. Restarting the directory does not resolve the problem. > Our servers have currently been up for 108 days, and are restarting the > service several times a day, as a result of the checks. Only if we reboot > the systems does the problem subside. > > CPU utilization seems relatively high for such a small directory, but it's > not constant. I tried to manually capture a bit of data with strace during > a period when CPU use was bursting high. During a capture of maybe two > seconds, I saw most of the CPU time was spent in futex. usecs/call was > fairly high for calls to futex and select, as detailed below. > > Since restarting the service doesn't fix the problem, it seems most likely > that this is an OS bug, but I'm hoping that the list can help me identify > other useful data to track down the problem. Does anyone have any > suggestions for what I can capture now, while I can sometimes observe the > problem? If I reboot, it'll take months before I can get any new data. > > > % time seconds usecs/call calls errors syscall > ------ ----------- ----------- --------- --------- ---------------- > 74.61 4.505251 3590 1255 340 futex > 17.65 1.065548 6660 160 select > 4.41 0.266344 88781 3 2 restart_syscall > 3.07 0.185566 50 3718 poll > 0.10 0.006185 2 3610 sendto > 0.09 0.005189 5189 1 fsync > 0.04 0.002134 37 58 write > 0.03 0.001618 27 61 setsockopt > 0.00 0.000111 3 36 recvfrom > 0.00 0.000078 1 57 read > 0.00 0.000014 14 1 fstat > 0.00 0.000003 2 2 accept > 0.00 0.000003 1 6 fcntl > 0.00 0.000002 1 2 getsockname > 0.00 0.000001 1 2 close > ------ ----------- ----------- --------- --------- ---------------- > 100.00 6.038047 8972 342 total > _______________________________________________ > 389-users mailing list -- [email protected] > To unsubscribe send an email to [email protected] >
_______________________________________________ 389-users mailing list -- [email protected] To unsubscribe send an email to [email protected]
