What is the test filter like?
Can we see a sanitized sample of the access log with the SRCH and RESULT?

If using SSL, review the output of
cat /proc/sys/kernel/random/entropy_avail

Do we have replication? (and large attribute values?)

You may want to run the "dbmon.sh" script to monitor cache usage for db
cache and entry cache, try to capture a few samples of line about
 dbcachefree and userroot:ent (if the db with the problems is
userroot), when the searches are becoming too long, like this example:
INCR=1 HOST=m2.example.com BINDDN="cn=directory manager" BINDPW="password"
VERBOSE=2 /usr/sbin/dbmon.sh

and review the ns-slapd errors and system messages log files for any
unusual activity.

what is the ns-slapd memory foot print from restart to slow responses?
any "too high" disk i/o? (or "bad" ssd?)

Thanks,
M.

On Tue, Nov 15, 2016 at 11:40 AM, Gordon Messmer <[email protected]>
wrote:

> I'm trying to track down a problem we are seeing on two relatively lightly
> used instances on CentOS 7 (and previously on CentOS 6, which is no longer
> in use).  Our servers have 3624 entries according to last night's export
> (we export userRoot daily).  There are currently just over 400 connections
> established to each server.
>
> We have a local cron job that runs every 5 minutes that performs a simple
> query.  If it takes more than 7 seconds to get an answer, the attempt is
> aborted and another query issued.  If three consecutive test fail, the
> directory server is restarted.
>
> The issue we're seeing is that the longer the system is up, the more often
> checks will fail.  Restarting the directory does not resolve the problem.
> Our servers have currently been up for 108 days, and are restarting the
> service several times a day, as a result of the checks.  Only if we reboot
> the systems does the problem subside.
>
> CPU utilization seems relatively high for such a small directory, but it's
> not constant.  I tried to manually capture a bit of data with strace during
> a period when CPU use was bursting high.  During a capture of maybe two
> seconds, I saw most of the CPU time was spent in futex.  usecs/call was
> fairly high for calls to futex and select, as detailed below.
>
> Since restarting the service doesn't fix the problem, it seems most likely
> that this is an OS bug, but I'm hoping that the list can help me identify
> other useful data to track down the problem.  Does anyone have any
> suggestions for what I can capture now, while I can sometimes observe the
> problem?  If I reboot, it'll take months before I can get any new data.
>
>
> % time     seconds  usecs/call     calls    errors syscall
> ------ ----------- ----------- --------- --------- ----------------
>  74.61    4.505251        3590      1255       340 futex
>  17.65    1.065548        6660       160           select
>   4.41    0.266344       88781         3         2 restart_syscall
>   3.07    0.185566          50      3718           poll
>   0.10    0.006185           2      3610           sendto
>   0.09    0.005189        5189         1           fsync
>   0.04    0.002134          37        58           write
>   0.03    0.001618          27        61           setsockopt
>   0.00    0.000111           3        36           recvfrom
>   0.00    0.000078           1        57           read
>   0.00    0.000014          14         1           fstat
>   0.00    0.000003           2         2           accept
>   0.00    0.000003           1         6           fcntl
>   0.00    0.000002           1         2           getsockname
>   0.00    0.000001           1         2           close
> ------ ----------- ----------- --------- --------- ----------------
> 100.00    6.038047                  8972       342 total
> _______________________________________________
> 389-users mailing list -- [email protected]
> To unsubscribe send an email to [email protected]
>
_______________________________________________
389-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to