Re: [Freeipa-users] ns-slapd high cpu usage

Ludwig Krispenz Thu, 16 Jul 2015 01:37:51 -0700

Thank you for the data, I think I understand now what is going on.


In the error logs we see only message like (from my test env):

[16/Jul/2015:10:12:40 +0200] NSMMReplicationPlugin - agmt="cn=100-300"(localhost:9759): replay_update: modifys operation(dn="dc=example,dc=com" csn=55a82a29000100640000) not sent - empty[16/Jul/2015:10:12:40 +0200] NSMMReplicationPlugin - agmt="cn=100-300"(localhost:9759): replay_update: Consumer successfully sent operationwith csn 55a82a29000100640000[16/Jul/2015:10:12:40 +0200] NSMMReplicationPlugin - agmt="cn=100-300"(localhost:9759): Skipping update operation with no message_id (uniqueid7507cb26-e8ac11e2-b2898005-8430f734, CSN 55a82a29000100640000):

This happens if fractional replication is configured as IPA does and thechanges affect only attributes which will NOT be replicated. So tehlocal RUV will be updated, but since no change is really sent, theconsumer RUV is not updated and replciation will always set off from anvery old starting csn. It is a rare scenario where a server receivesonly mods which are not replicated.


I have opened a ticket for this: https://fedorahosted.org/389/ticket/48225

As a workaround can you try to apply a mod on m14-26 which will not bestripped, either create a dummy user or add a description attribute toan existing object. Repliciation will once again iterate thru all thechanges (which can take a while), but then should replay this latestchange and define a new offset


Regards,
Ludwig


On 07/15/2015 07:05 PM, Andrew E. Bruno wrote:

On Wed, Jul 15, 2015 at 04:58:23PM +0200, Ludwig Krispenz wrote:

On 07/15/2015 04:10 PM, Andrew E. Bruno wrote:

On Wed, Jul 15, 2015 at 03:22:51PM +0200, Ludwig Krispenz wrote:

On 07/14/2015 08:59 PM, Andrew E. Bruno wrote:

On Tue, Jul 14, 2015 at 04:52:10PM +0200, Ludwig Krispenz wrote:

hm, the stack traces show csn_str, which correspond to Jul,8th, Jul,4th, and
Jul,7th - so it looks like it is iterating the changelog over and over
again.
Th consumer side Is "cn=meTosrv-m14-24.ccr.buffalo.edu" - is this the master
?

can you provide the result of the following search from
m14-24.ccr.buffalo.edu adn the server with the high cpu:

ldapsearch -o ldif-wrap=no -x -D ... -w  -b "cn=config"
"objectclass=nsds5replica" nsds50ruv

master is srv-m14-24.. here's the results of the ldapsearch:

[srv-m14-24 ~]$ ldapsearch -o ldif-wrap=no -x -D "cn=directory manager" -W  -b 
"cn=config" "objectclass=nsds5replica" nsds50ruv

# replica, dc\3Dccr\2Cdc\3Dbuffalo\2Cdc\3Dedu, mapping tree, config
dn: cn=replica,cn=dc\3Dccr\2Cdc\3Dbuffalo\2Cdc\3Dedu,cn=mapping tree,cn=config
nsds50ruv: {replicageneration} 5527f711000000040000
nsds50ruv: {replica 4 ldap://srv-m14-24.ccr.buffalo.edu:389} 
5527f771000000040000 55a55aed001000040000
nsds50ruv: {replica 5 ldap://srv-m14-26.ccr.buffalo.edu:389} 
5537c773000000050000 5591a3d2000700050000
nsds50ruv: {replica 6 ldap://srv-m14-25-02.ccr.buffalo.edu:389} 
55943dda000000060000 55945378000200060000

so this is really strange, the master m14-24 has the latest change from
replica 5(m14-26) as: 5591a3d2000700050000
which corresponds to Mon, 29 Jun 2015 20:00:18 GMT
so no update from 14-24 since that did arrive, or could not update the ruv.
So m14-26 tries to replicate all the changes back from that time, but looks
like iit has no success.
is there anything in the logs of m14-24 ? can you see successful mods with
csn=xxxxxxx00050000 ?

Here's what I could find from the logs on srv-m14-24:


[srv-m14-24 ~]# grep -r 00050000 /var/log/dirsrv/slapd-[domain]/*
access.20150714-014346:[14/Jul/2015:03:10:05 -0400] conn=748529 op=14732 RESULT 
err=0 tag=103 nentries=0 etime=1 csn=55a4b5f0005000040000

ok, so no update originating at replica 5 has been replicated (probably
since June,29) did you experience data inconsistency between the servers ?


And here's the last few lines the error log on srv-m14-24:

one set of messages refers to the o=ipaca backend and seem to be transient,
replication continues later.
the other set of msg "No original tombstone .." is annoying (and it is fixed
in ticket https://fedorahosted.org/389/ticket/47912)

the next thing we can do to try to understand what is going on is to enable
replication logging on m14-26, it will then not only consume all cpu, but
write tons of messages to the error log.
But it can be turned on and off:

ldapmodify ...
dn: cn=config
replace: nsslapd-errorlog-level
nsslapd-errorlog-level: 8192

and let it run for a while, then set it back to: 0

I enabled replication logging and it's running now. I noticed the
default value for nsslapd-errorlog-level was set to 16384 (not 0).

OK to send you the logs off list? Looks like they contain quite a bit of
sensitive data.

Thanks again for all the help looking into this.

Best,

--Andrew

[12/Jul/2015:10:11:14 -0400] ldbm_back_delete - conn=0 op=0 [retry: 1] No 
original_tombstone for changenumber=2456070,cn=changelog!!
[12/Jul/2015:10:11:48 -0400] ldbm_back_delete - conn=0 op=0 [retry: 1] No 
original_tombstone for changenumber=2498441,cn=changelog!!
[13/Jul/2015:07:41:49 -0400] NSMMReplicationPlugin - 
agmt="cn=masterAgreement1-srv-m14-26.ccr.buffalo.edu-pki-tomcat" 
(srv-m14-26:389): Consumer failed to replay change (uniqueid 
cb7acfc1-df9211e4-a351aa45-2e06257b, CSN 55a3a406000000600000): Operations error (1). 
Will retry later.
[13/Jul/2015:11:56:50 -0400] NSMMReplicationPlugin - 
agmt="cn=masterAgreement1-srv-m14-26.ccr.buffalo.edu-pki-tomcat" 
(srv-m14-26:389): Consumer failed to replay change (uniqueid 
cb7acfc1-df9211e4-a351aa45-2e06257b, CSN 55a3dfca000000600000): Operations error (1). 
Will retry later.
[13/Jul/2015:14:26:50 -0400] NSMMReplicationPlugin - 
agmt="cn=masterAgreement1-srv-m14-26.ccr.buffalo.edu-pki-tomcat" 
(srv-m14-26:389): Consumer failed to replay change (uniqueid 
cb7acfc1-df9211e4-a351aa45-2e06257b, CSN 55a402f2000000600000): Operations error (1). 
Will retry later.
[13/Jul/2015:15:26:49 -0400] NSMMReplicationPlugin - 
agmt="cn=masterAgreement1-srv-m14-26.ccr.buffalo.edu-pki-tomcat" 
(srv-m14-26:389): Consumer failed to replay change (uniqueid 
cb7acfc1-df9211e4-a351aa45-2e06257b, CSN 55a41102000000600000): Operations error (1). 
Will retry later.
[13/Jul/2015:18:26:51 -0400] NSMMReplicationPlugin - 
agmt="cn=masterAgreement1-srv-m14-26.ccr.buffalo.edu-pki-tomcat" 
(srv-m14-26:389): Consumer failed to replay change (uniqueid 
cb7acfc1-df9211e4-a351aa45-2e06257b, CSN 55a43b32000000600000): Operations error (1). 
Will retry later.
[13/Jul/2015:18:56:51 -0400] NSMMReplicationPlugin - 
agmt="cn=masterAgreement1-srv-m14-26.ccr.buffalo.edu-pki-tomcat" 
(srv-m14-26:389): Consumer failed to replay change (uniqueid 
cb7acfc1-df9211e4-a351aa45-2e06257b, CSN 55a4423a000000600000): Operations error (1). 
Will retry later.
[13/Jul/2015:20:41:51 -0400] NSMMReplicationPlugin - 
agmt="cn=masterAgreement1-srv-m14-26.ccr.buffalo.edu-pki-tomcat" 
(srv-m14-26:389): Consumer failed to replay change (uniqueid 
cb7acfc1-df9211e4-a351aa45-2e06257b, CSN 55a45ad6000000600000): Operations error (1). 
Will retry later.
[13/Jul/2015:22:41:51 -0400] NSMMReplicationPlugin - 
agmt="cn=masterAgreement1-srv-m14-26.ccr.buffalo.edu-pki-tomcat" 
(srv-m14-26:389): Consumer failed to replay change (uniqueid 
cb7acfc1-df9211e4-a351aa45-2e06257b, CSN 55a476f6000000600000): Operations error (1). 
Will retry later.
[14/Jul/2015:06:56:51 -0400] NSMMReplicationPlugin - 
agmt="cn=masterAgreement1-srv-m14-26.ccr.buffalo.edu-pki-tomcat" 
(srv-m14-26:389): Consumer failed to replay change (uniqueid 
cb7acfc1-df9211e4-a351aa45-2e06257b, CSN 55a4eafa000000600000): Operations error (1). 
Will retry later.
[14/Jul/2015:09:56:52 -0400] NSMMReplicationPlugin - 
agmt="cn=masterAgreement1-srv-m14-26.ccr.buffalo.edu-pki-tomcat" 
(srv-m14-26:389): Consumer failed to replay change (uniqueid 
cb7acfc1-df9211e4-a351aa45-2e06257b, CSN 55a5152a000000600000): Operations error (1). 
Will retry later.
[14/Jul/2015:10:11:21 -0400] ldbm_back_delete - conn=0 op=0 [retry: 1] No 
original_tombstone for changenumber=2552223,cn=changelog!!
[14/Jul/2015:10:11:21 -0400] ldbm_back_delete - conn=0 op=0 [retry: 1] No 
original_tombstone for changenumber=2552224,cn=changelog!!
[14/Jul/2015:10:11:25 -0400] ldbm_back_delete - conn=0 op=0 [retry: 1] No 
original_tombstone for changenumber=2557315,cn=changelog!!
[14/Jul/2015:10:11:25 -0400] ldbm_back_delete - conn=0 op=0 [retry: 1] No 
original_tombstone for changenumber=2557318,cn=changelog!!
[14/Jul/2015:10:11:28 -0400] ldbm_back_delete - conn=0 op=0 [retry: 1] No 
original_tombstone for changenumber=2561020,cn=changelog!!
[14/Jul/2015:10:11:28 -0400] ldbm_back_delete - conn=0 op=0 [retry: 1] No 
original_tombstone for changenumber=2561043,cn=changelog!!
[14/Jul/2015:10:11:48 -0400] ldbm_back_delete - conn=0 op=0 [retry: 1] No 
original_tombstone for changenumber=2586022,cn=changelog!!
[14/Jul/2015:10:11:59 -0400] ldbm_back_delete - conn=0 op=0 [retry: 1] No 
original_tombstone for changenumber=2598989,cn=changelog!!
[14/Jul/2015:10:11:59 -0400] ldbm_back_delete - conn=0 op=0 [retry: 1] No 
original_tombstone for changenumber=2598990,cn=changelog!!
[14/Jul/2015:10:12:01 -0400] ldbm_back_delete - conn=0 op=0 [retry: 1] No 
original_tombstone for changenumber=2600966,cn=changelog!!
[14/Jul/2015:10:12:03 -0400] ldbm_back_delete - conn=0 op=0 [retry: 1] No 
original_tombstone for changenumber=2604037,cn=changelog!!
[14/Jul/2015:10:12:04 -0400] ldbm_back_delete - conn=0 op=0 [retry: 1] No 
original_tombstone for changenumber=2604054,cn=changelog!!
[14/Jul/2015:10:12:24 -0400] ldbm_back_delete - conn=0 op=0 [retry: 1] No 
original_tombstone for changenumber=2629803,cn=changelog!!
[14/Jul/2015:10:12:24 -0400] ldbm_back_delete - conn=0 op=0 [retry: 1] No 
original_tombstone for changenumber=2629804,cn=changelog!!

# replica, o\3Dipaca, mapping tree, config
dn: cn=replica,cn=o\3Dipaca,cn=mapping tree,cn=config
nsds50ruv: {replicageneration} 5527f74b000000600000
nsds50ruv: {replica 96 ldap://srv-m14-24.ccr.buffalo.edu:389} 
5527f754000000600000 55a557f6000000600000
nsds50ruv: {replica 86 ldap://srv-m14-25-02.ccr.buffalo.edu:389} 
55943e6e000000560000 55943e6f000100560000
nsds50ruv: {replica 91 ldap://srv-m14-26.ccr.buffalo.edu:389} 
5537c7ba0000005b0000 5582c7e40004005b0000


server with high cpu load is srv-m14-26. here's the results of the ldapsearch

>from this server:

[srv-m14-26 ~]$ ldapsearch -o ldif-wrap=no -x -D "cn=directory manager" -W  -b 
"cn=config" "objectclass=nsds5replica" nsds50ruv

# replica, dc\3Dccr\2Cdc\3Dbuffalo\2Cdc\3Dedu, mapping tree, config
dn: cn=replica,cn=dc\3Dccr\2Cdc\3Dbuffalo\2Cdc\3Dedu,cn=mapping tree,cn=config
nsds50ruv: {replicageneration} 5527f711000000040000
nsds50ruv: {replica 5 ldap://srv-m14-26.ccr.buffalo.edu:389} 
5537c773000000050000 55a55b47000300050000
nsds50ruv: {replica 4 ldap://srv-m14-24.ccr.buffalo.edu:389} 
5527f771000000040000 55a53eb0000a00040000
nsds50ruv: {replica 6 ldap://srv-m14-25-02.ccr.buffalo.edu:389} 
55943dda000000060000 55945378000200060000

# replica, o\3Dipaca, mapping tree, config
dn: cn=replica,cn=o\3Dipaca,cn=mapping tree,cn=config
nsds50ruv: {replicageneration} 5527f74b000000600000
nsds50ruv: {replica 91 ldap://srv-m14-26.ccr.buffalo.edu:389} 
5537c7ba0000005b0000 5582c7e40004005b0000
nsds50ruv: {replica 96 ldap://srv-m14-24.ccr.buffalo.edu:389} 
5527f754000000600000 55a557f6000000600000
nsds50ruv: {replica 86 ldap://srv-m14-25-02.ccr.buffalo.edu:389} 
55943e6e000000560000 55943e6f000100560000


srv-m14-25-02 is our 3rd replicate which we recently added back in after it
failed (was added back in 7/1).

Let me know if you need anything else. Thanks for the help.

--Andrew

On 07/14/2015 02:35 PM, Andrew E. Bruno wrote:

On Tue, Jul 14, 2015 at 01:41:57PM +0200, Ludwig Krispenz wrote:

On 07/13/2015 06:36 PM, Andrew E. Bruno wrote:

On Mon, Jul 13, 2015 at 05:29:13PM +0200, Ludwig Krispenz wrote:

On 07/13/2015 05:05 PM, Andrew E. Bruno wrote:

On Mon, Jul 13, 2015 at 04:58:46PM +0200, Ludwig Krispenz wrote:

can you get a pstack of the slapd process along with a top -H to find th
ethread with high cpu usage

Attached is the full stacktrace of the running ns-slapd proccess. top -H
shows this thread (2879) with high cpu usage:

2879 dirsrv    20   0 3819252 1.962g  11680 R 99.9  3.1   8822:10 ns-slapd

this thread is a replication thread sending updates, what is strange is that
the current csn_str is quite old (july, 7th), I can't tell which agreeement
this thread is handling, but looks like it is heavily reading the changeglog
and sending updates. anything changed recently in replication setup ?

Yes, we had one replica fail on (6/19) which we removed (not this one
showing high CPU load). Had to perform some manual cleanup of the ipa-ca
RUVs. Then we added the replica back in on 7/1. Since then, replication
appears to have been running normally between the 3 replicas. We've been
monitoring utilization since 7/1 and only recently seen this spike (past
24 hours or so).

is it still in this state ? or was it a spike.

Yes same state.

if it still is high cpu consuming, could you
- get a few pstack like the one before with some time in between, I would
like to see if it is progressing with the csns or looping on the same one

Attached are a few stacktraces. The thread pegging the cpu is:

PID  USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+  COMMAND
2879 dirsrv    20   0 3819252 1.978g  11684 R  99.9  3.2  10148:26  ns-slapd

- check the consumer side. is there anything in the error log ? does the
access log show replication activity from this server

Here's some errors showing up on the first master server rep1 (rep2 is the
server with pegged cpu):

[13/Jul/2015:20:41:51 -0400] NSMMReplicationPlugin - 
agmt="cn=masterAgreement1-rep2-pki-tomcat" (rep2:389): Consumer failed to 
replay change (uniqueid cb7acfc1-df9211e4-a351aa45-2e06257b, CSN 55a45ad6000000600000): 
Operations error (1). Will retry later.
[13/Jul/2015:22:41:51 -0400] NSMMReplicationPlugin - 
agmt="cn=masterAgreement1-rep2-pki-tomcat" (rep2:389): Consumer failed to 
replay change (uniqueid cb7acfc1-df9211e4-a351aa45-2e06257b, CSN 55a476f6000000600000): 
Operations error (1). Will retry later.
[14/Jul/2015:06:56:51 -0400] NSMMReplicationPlugin - 
agmt="cn=masterAgreement1-rep2-tomcat" (rep2:389): Consumer failed to replay 
change (uniqueid cb7acfc1-df9211e4-a351aa45-2e06257b, CSN 55a4eafa000000600000): 
Operations error (1). Will retry later.


Here's some snips from the access log of the rep2:


[14/Jul/2015:08:22:31 -0400] conn=777787 op=9794 EXT oid="2.16.840.1.113730.3.5.5" 
name="Netscape Replication End Session"
[14/Jul/2015:08:22:31 -0400] conn=777787 op=9794 RESULT err=0 tag=120 
nentries=0 etime=0
[14/Jul/2015:08:22:31 -0400] conn=777787 op=9795 EXT oid="2.16.840.1.113730.3.5.12" 
name="replication-multimaster-extop"
[14/Jul/2015:08:22:31 -0400] conn=777787 op=9795 RESULT err=0 tag=120 
nentries=0 etime=0
[14/Jul/2015:08:22:33 -0400] conn=777787 op=9796 EXT oid="2.16.840.1.113730.3.5.5" 
name="Netscape Replication End Session"
..
[14/Jul/2015:08:23:38 -0400] conn=782341 op=129 RESULT err=0 tag=103 nentries=0 
etime=0 csn=55a4ff6c000000050000
..
[14/Jul/2015:08:24:02 -0400] conn=781901 op=1745 RESULT err=0 tag=101 
nentries=1 etime=0
[14/Jul/2015:08:24:03 -0400] conn=777787 op=9810 EXT oid="2.16.840.1.113730.3.5.5" 
name="Netscape Replication End Session"
[14/Jul/2015:08:24:03 -0400] conn=777787 op=9810 RESULT err=0 tag=120 
nentries=0 etime=0
[14/Jul/2015:08:24:03 -0400] conn=777787 op=9811 EXT oid="2.16.840.1.113730.3.5.12" 
name="replication-multimaster-extop"
[14/Jul/2015:08:24:03 -0400] conn=777787 op=9811 RESULT err=0 tag=120 
nentries=0 etime=0
[14/Jul/2015:08:24:05 -0400] conn=777787 op=9812 EXT oid="2.16.840.1.113730.3.5.5" 
name="Netscape Replication End Session"
[14/Jul/2015:08:24:05 -0400] conn=777787 op=9812 RESULT err=0 tag=120 
nentries=0 etime=0
[14/Jul/2015:08:24:08 -0400] conn=777787 op=9813 EXT oid="2.16.840.1.113730.3.5.12" 
name="replication-multimaster-extop"
[14/Jul/2015:08:24:08 -0400] conn=777787 op=9813 RESULT err=0 tag=120 
nentries=0 etime=0
[14/Jul/2015:08:24:10 -0400] conn=777787 op=9814 EXT oid="2.16.840.1.113730.3.5.5" 
name="Netscape Replication End Session"
[14/Jul/2015:08:24:10 -0400] conn=777787 op=9814 RESULT err=0 tag=120 
nentries=0 etime=0
[14/Jul/2015:08:24:10 -0400] conn=777787 op=9815 EXT oid="2.16.840.1.113730.3.5.12" 
name="replication-multimaster-extop"
[14/Jul/2015:08:24:10 -0400] conn=777787 op=9815 RESULT err=0 tag=120 
nentries=0 etime=0

and here's some from the error log:

[13/Jul/2015:22:41:49 -0400] DSRetroclPlugin - replog: an error occured while 
adding change number 2294859, dn = changenumber=2294859,cn=changelog: 
Operations error.
[13/Jul/2015:22:41:49 -0400] retrocl-plugin - retrocl_postob: operation failure 
[1]
[13/Jul/2015:23:56:50 -0400] - dn2entry_ext: Failed to get id for 
changenumber=2296384,cn=changelog from entryrdn index (-30993)
[13/Jul/2015:23:56:50 -0400] - Operation error fetching 
changenumber=2296384,cn=changelog (null), error -30993.
[13/Jul/2015:23:56:50 -0400] DSRetroclPlugin - replog: an error occured while 
adding change number 2296384, dn = changenumber=2296384,cn=changelog: 
Operations error.
[13/Jul/2015:23:56:50 -0400] retrocl-plugin - retrocl_postob: operation failure 
[1]
[14/Jul/2015:06:56:50 -0400] - dn2entry_ext: Failed to get id for 
changenumber=2304418,cn=changelog from entryrdn index (-30993)
[14/Jul/2015:06:56:50 -0400] - Operation error fetching 
changenumber=2304418,cn=changelog (null), error -30993.
[14/Jul/2015:06:56:50 -0400] DSRetroclPlugin - replog: an error occured while 
adding change number 2304418, dn = changenumber=2304418,cn=changelog: 
Operations error.
[14/Jul/2015:06:56:50 -0400] retrocl-plugin - retrocl_postob: operation failure 
[1]

- eventually enable replication logging: nsslapd-errorlog-level: 8192

On a side note, we get hit with this bug often:

https://www.redhat.com/archives/freeipa-users/2015-July/msg00018.html

(rouge sssd_be processing hammering a replica).

This causes high ns-slapd utilization on the replica and restarting sssd
on the client host immediately fixes the issue. However, in this
case, we're not seeing this behavior.

On 07/13/2015 04:46 PM, Andrew E. Bruno wrote:

We have 3 freeipa-replicas. Centos 7.1.1503, ipa-server 4.1.0-18, and
389-ds 1.3.3.1-16.

Recently, the ns-slapd process on one of our replicas started showing higher
than normal CPU usage. ns-slapd is pegged at high CPU usage more or less
constantly.

Seems very similar to this thread:
https://www.redhat.com/archives/freeipa-users/2015-February/msg00281.html

There are a few errors showing in /var/log/dirsrv/slapd-[domain]/errors (not
sure if these are related):


[13/Jul/2015:02:56:49 -0400] retrocl-plugin - retrocl_postob: operation failure 
[1]
[13/Jul/2015:04:11:50 -0400] - dn2entry_ext: Failed to get id for 
changenumber=2277387,cn=changelog from entryrdn index (-30993)
[13/Jul/2015:04:11:50 -0400] - Operation error fetching 
changenumber=2277387,cn=changelog (null), error -30993.
[13/Jul/2015:04:11:50 -0400] DSRetroclPlugin - replog: an error occured while 
adding change number 2277387, dn = changenumber=2277387,cn=changelog: 
Operations error.
[13/Jul/2015:04:11:50 -0400] retrocl-plugin - retrocl_postob: operation failure 
[1]
[13/Jul/2015:07:41:49 -0400] - Operation error fetching Null DN 
(01de36ac-295411e5-b94db2ab-07afbca6), error -30993.
[13/Jul/2015:07:41:49 -0400] - dn2entry_ext: Failed to get id for 
changenumber=2281464,cn=changelog from entryrdn index (-30993)
[13/Jul/2015:07:41:49 -0400] - Operation error fetching 
changenumber=2281464,cn=changelog (null), error -30993.
[13/Jul/2015:07:41:49 -0400] DSRetroclPlugin - replog: an error occured while 
adding change number 2281464, dn = changenumber=2281464,cn=changelog: 
Operations error.
[13/Jul/2015:07:41:49 -0400] retrocl-plugin - retrocl_postob: operation failure 
[1]


access logs seem to be showing normal activity. Here's the number of open
connections:

# ls -al /proc/`cat /var/run/dirsrv/slapd-[domain].pid`/fd|grep socket|wc -l
62

Note: the other two replicas have much higher open connections (>250) and low
cpu load avgs.

Here's some output of logconv.pl from our most recent access log on the replica
with high cpu load:

Start of Logs:    13/Jul/2015:04:49:18
End of Logs:      13/Jul/2015:10:06:11

Processed Log Time:  5 Hours, 16 Minutes, 53 Seconds

Restarts:                     0
Total Connections:            2343
  - LDAP Connections:          2120
  - LDAPI Connections:         223
  - LDAPS Connections:         0
  - StartTLS Extended Ops:     45
  Secure Protocol Versions:
   - TLS1.2 128-bit AES - 45

Peak Concurrent Connections:  22
Total Operations:             111865
Total Results:                111034
Overall Performance:          99.3%

Searches:                     95585         (5.03/sec)  (301.64/min)
Modifications:                3369          (0.18/sec)  (10.63/min)
Adds:                         0             (0.00/sec)  (0.00/min)
Deletes:                      0             (0.00/sec)  (0.00/min)
Mod RDNs:                     0             (0.00/sec)  (0.00/min)
Compares:                     0             (0.00/sec)  (0.00/min)
Binds:                        7082          (0.37/sec)  (22.35/min)

Proxied Auth Operations:      0
Persistent Searches:          0
Internal Operations:          0
Entry Operations:             0
Extended Operations:          5317
Abandoned Requests:           416
Smart Referrals Received:     0

VLV Operations:               96
VLV Unindexed Searches:       0
VLV Unindexed Components:     32
SORT Operations:              64

Entire Search Base Queries:   0
Paged Searches:               3882
Unindexed Searches:           0
Unindexed Components:         5

FDs Taken:                    2566
FDs Returned:                 2643
Highest FD Taken:             249

Broken Pipes:                 0
Connections Reset By Peer:    0
Resource Unavailable:         0
Max BER Size Exceeded:        0

Binds:                        7082
Unbinds:                      2443
  - LDAP v2 Binds:             0
  - LDAP v3 Binds:             6859
  - AUTOBINDs:                 223
  - SSL Client Binds:          0
  - Failed SSL Client Binds:   0
  - SASL Binds:                6814
     GSSAPI - 6591
     EXTERNAL - 223
  - Directory Manager Binds:   0
  - Anonymous Binds:           6591
  - Other Binds:               491




strace timing on the ns-slapd process:


% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
  94.40    0.346659        5977        58           poll
   4.10    0.015057       15057         1           restart_syscall
   0.91    0.003353          57        59        59 getpeername
   0.49    0.001796         150        12           futex
   0.10    0.000364          73         5           read
------ ----------- ----------- --------- --------- ----------------
100.00    0.367229                   135        59 total


top output (with threads 'H'):

   PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND
  2879 dirsrv    20   0 3819252 1.962g  11680 R 99.9  3.1   8822:10 ns-slapd
  2895 dirsrv    20   0 3819252 1.962g  11680 R 34.1  3.1 115:10.62 ns-slapd
  2889 dirsrv    20   0 3819252 1.962g  11680 R  2.4  3.1 115:34.42 ns-slapd
  2917 dirsrv    20   0 3819252 1.962g  11680 S  2.4  3.1 115:26.87 ns-slapd
  2898 dirsrv    20   0 3819252 1.962g  11680 S  2.1  3.1 116:33.12 ns-slapd
  2904 dirsrv    20   0 3819252 1.962g  11680 S  2.1  3.1 115:08.56 ns-slapd
  2892 dirsrv    20   0 3819252 1.962g  11680 S  1.8  3.1 115:33.04 ns-slapd
  2897 dirsrv    20   0 3819252 1.962g  11680 R  1.8  3.1 114:54.28 ns-slapd
  2914 dirsrv    20   0 3819252 1.962g  11680 R  1.8  3.1 116:03.35 ns-slapd
  2907 dirsrv    20   0 3819252 1.962g  11680 S  1.5  3.1 115:42.25 ns-slapd
  2910 dirsrv    20   0 3819252 1.962g  11680 R  1.5  3.1 116:01.99 ns-slapd
  2870 dirsrv    20   0 3819252 1.962g  11680 R  1.2  3.1 611:30.22 ns-slapd
  2890 dirsrv    20   0 3819252 1.962g  11680 S  1.2  3.1 115:18.25 ns-slapd
  2891 dirsrv    20   0 3819252 1.962g  11680 S  1.2  3.1 115:22.24 ns-slapd
  2899 dirsrv    20   0 3819252 1.962g  11680 R  1.2  3.1 116:11.85 ns-slapd
  2888 dirsrv    20   0 3819252 1.962g  11680 S  0.9  3.1 114:51.19 ns-slapd
  2896 dirsrv    20   0 3819252 1.962g  11680 R  0.9  3.1 115:46.84 ns-slapd
  2915 dirsrv    20   0 3819252 1.962g  11680 S  0.9  3.1 115:49.34 ns-slapd
  2887 dirsrv    20   0 3819252 1.962g  11680 R  0.6  3.1 115:49.85 ns-slapd
  2894 dirsrv    20   0 3819252 1.962g  11680 S  0.6  3.1 115:58.02 ns-slapd
  2911 dirsrv    20   0 3819252 1.962g  11680 S  0.6  3.1 116:22.84 ns-slapd
  2913 dirsrv    20   0 3819252 1.962g  11680 S  0.6  3.1 114:43.56 ns-slapd


ns-slapd stays pegged >99%.  Trying to figure out what ns-slapd is doing? Any
pointers on where else to look?

Thanks in advance.

--Andrew

--
Manage your subscription for the Freeipa-users mailing list:
https://www.redhat.com/mailman/listinfo/freeipa-users
Go to http://freeipa.org for more info on the project


--
Manage your subscription for the Freeipa-users mailing list:
https://www.redhat.com/mailman/listinfo/freeipa-users
Go to http://freeipa.org for more info on the project

Re: [Freeipa-users] ns-slapd high cpu usage

Reply via email to