Rich Megginson wrote:
On 06/19/2015 12:22 PM, Andrew E. Bruno wrote:
Hello,

First time trouble shooting an ipa server failure and looking for some
guidance on how best to proceed.

First some background on our setup:

Servers are running freeipa v4.1.0 on CentOS 7.1.1503:

- ipa-server-4.1.0-18.el7.centos.3.x86_64
- 389-ds-base-1.3.3.1-16.el7_1.x86_64

3 ipa-servers, 1 first master (rep1) and 2 (rep2, rep3) replicates. The
replicates were setup to be ca's (i.e. ipa-replica-install --setup-ca...)

We have ~3000 user accounts (~1000 active the rest disabled). We have
~700 hosts enrolled (all installed using ipa-client-install and running
sssd). Hosts clients are a mix of centos 7 and centos 6.5.


We recently discovered one of our replica servers (rep2) was not
responding. A quick check of the dirsrv logs
/var/log/dirsrv/slapd-XXXX/errors (sanitized):

     PR_Accept() failed, Netscape Portable Runtime error (Process open
     FD table is full.)
     ...

The server was rebooted and after coming back up had these errors in
the logs:

         389-Directory/1.3.3.1 B2015.118.1941
         replica2:636 (/etc/dirsrv/slapd-XXXX)

[16/Jun/2015:10:12:33 -0400] - libdb: BDB0060 PANIC: fatal region
error detected; run recovery
[16/Jun/2015:10:12:33 -0400] - Serious Error---Failed to trickle,
err=-30973 (BDB0087 DB_RUNRECOVERY: Fatal error, run database recovery)
[16/Jun/2015:10:12:33 -0400] - libdb: BDB0060 PANIC: fatal region
error detected; run recovery
[16/Jun/2015:10:12:33 -0400] - Serious Error---Failed in deadlock
detect (aborted at 0x0), err=-30973 (BDB0087 DB_RUNRECOVERY: Fatal
error, run database recovery)
[16/Jun/2015:10:12:33 -0400] - libdb: BDB0060 PANIC: fatal region
error detected; run recovery
[16/Jun/2015:10:12:33 -0400] - Serious Error---Failed in deadlock
detect (aborted at 0x0), err=-30973 (BDB0087 DB_RUNRECOVERY: Fatal
error, run database recovery)
[16/Jun/2015:10:12:33 -0400] - libdb: BDB0060 PANIC: fatal region
error detected; run recovery
[16/Jun/2015:10:12:33 -0400] - Serious Error---Failed to checkpoint
database, err=-30973 (BDB0087 DB_RUNRECOVERY: Fatal error, run
database recovery)
[16/Jun/2015:10:12:33 -0400] - libdb: BDB0060 PANIC: fatal region
error detected; run recovery
[16/Jun/2015:10:12:33 -0400] - Serious Error---Failed to checkpoint
database, err=-30973 (BDB0087 DB_RUNRECOVERY: Fatal error, run
database recovery)
[16/Jun/2015:10:12:33 -0400] - libdb: BDB0060 PANIC: fatal region
error detected; run recovery
[16/Jun/2015:10:12:33 -0400] - checkpoint_threadmain: log archive
failed - BDB0087 DB_RUNRECOVERY: Fatal error, run database recovery
(-30973)
....
[16/Jun/2015:16:24:04 -0400] - 389-Directory/1.3.3.1 B2015.118.1941
starting up
[16/Jun/2015:16:24:04 -0400] - Detected Disorderly Shutdown last time
Directory Server was running, recovering database.
...
[16/Jun/2015:16:24:15 -0400] NSMMReplicationPlugin -
replica_check_for_data_reload: Warning: disordely shutdown for replica
dc=XXX. Check if DB RUV needs to be updated
[16/Jun/2015:16:24:15 -0400] NSMMReplicationPlugin - Force update of
database RUV (from CL RUV) ->  55770068000300030000
[16/Jun/2015:16:24:15 -0400] NSMMReplicationPlugin - Force update of
database RUV (from CL RUV) ->  556f4632001400040000
[16/Jun/2015:16:24:15 -0400] NSMMReplicationPlugin - Force update of
database RUV (from CL RUV) ->  556f4631004d00050000
[16/Jun/2015:16:24:15 -0400] slapi_ldap_bind - Error: could not send
startTLS request: error -1 (Can't contact LDAP server) errno 111
(Connection refused)
[16/Jun/2015:16:24:15 -0400] NSMMReplicationPlugin -
agmt="cn=cloneAgreement1-rep2 (rep1:389): Replication bind with SIMPLE
auth failed: LDAP error -1 (Can't contact LDAP server) ()
[16/Jun/2015:16:24:15 -0400] NSMMReplicationPlugin -
replica_check_for_data_reload: Warning: disordely shutdown for replica
o=ipaca. Check if DB RUV needs to be updated
[16/Jun/2015:16:24:15 -0400] NSMMReplicationPlugin - Force update of
database RUV (from CL RUV) ->  556f46290005005b0000
[16/Jun/2015:16:24:15 -0400] set_krb5_creds - Could not get initial
credentials for principal [ldap/rep2] in keytab
[FILE:/etc/dirsrv/ds.keytab]: -1765328228 (Cannot contact any KDC for
requested realm)
[16/Jun/2015:16:24:15 -0400] slapd_ldap_sasl_interactive_bind - Error:
could not perform interactive bind for id [] mech [GSSAPI]: LDAP error
-1 (Can't contact LDAP server) ((null)) errno 111 (Connection refused)
[16/Jun/2015:16:24:15 -0400] slapi_ldap_bind - Error: could not
perform interactive bind for id [] authentication mechanism [GSSAPI]:
error -1 (Can't contact LDAP server)
[16/Jun/2015:16:24:15 -0400] NSMMReplicationPlugin -
agmt="cn=meTorep1" (rep1:389): Replication bind with GSSAPI auth
failed: LDAP error -1 (Can't contact LDAP server) ()
[16/Jun/2015:16:24:15 -0400] - Skipping CoS Definition cn=Password
Policy,cn=accounts,dc=xxx--no CoS Templates found, which should be
added before the CoS Definition.
[16/Jun/2015:16:24:15 -0400] DSRetroclPlugin - delete_changerecord:
could not delete change record 301438 (rc: 32)
[16/Jun/2015:16:24:15 -0400] DSRetroclPlugin - delete_changerecord:
could not delete change record 301439 (rc: 32)
[16/Jun/2015:16:24:15 -0400] DSRetroclPlugin - delete_changerecord:
could not delete change record 301440 (rc: 32)
[16/Jun/2015:16:24:15 -0400] DSRetroclPlugin - delete_changerecord:
could not delete change record 301441 (rc: 32)
....
[16/Jun/2015:16:24:15 -0400] DSRetroclPlugin - delete_changerecord:
could not delete change record 301443 (rc: 32)
[16/Jun/2015:16:24:15 -0400] DSRetroclPlugin - delete_changerecord:
could not delete change record 301444 (rc: 32)
[16/Jun/2015:16:24:15 -0400] - slapd started.  Listening on All
Interfaces port 389 for LDAP requests
[16/Jun/2015:16:24:15 -0400] - Listening on All Interfaces port 636
for LDAPS requests
[16/Jun/2015:16:24:15 -0400] - Listening on
/var/run/slapd-CCR-BUFFALO-EDU.socket for LDAPI requests
[16/Jun/2015:16:24:15 -0400] DSRetroclPlugin - delete_changerecord:
could not delete change record 301445 (rc: 32)
[16/Jun/2015:16:24:15 -0400] DSRetroclPlugin - delete_changerecord:
could not delete change record 301446 (rc: 32)
[16/Jun/2015:16:24:15 -0400] DSRetroclPlugin - delete_changerecord:
could not delete change record 301447 (rc: 32)
...
[16/Jun/2015:16:24:24 -0400] DSRetroclPlugin - delete_changerecord:
could not delete change record 336362 (rc: 32)
[16/Jun/2015:16:24:24 -0400] slapd_ldap_sasl_interactive_bind - Error:
could not perform interactive bind for id [] mech [GSSAPI]: LDAP error
-1 (Can't contact LDAP server) ((null)) errno 107 (Transport endpoint
is not connected)
[16/Jun/2015:16:24:24 -0400] slapi_ldap_bind - Error: could not
perform interactive bind for id [] authentication mechanism [GSSAPI]:
error -1 (Can't contact LDAP server)
[16/Jun/2015:16:24:24 -0400] slapi_ldap_bind - Error: could not send
startTLS request: error -1 (Can't contact LDAP server) errno 107
(Transport endpoint is not connected)
[16/Jun/2015:16:24:24 -0400] NSMMReplicationPlugin -
agmt="cn=cloneAgreement1-rep2-pki-tomcat" (rep1:389): Replication bind
with SIMPLE auth resumed
[16/Jun/2015:16:24:25 -0400] NSMMReplicationPlugin -
agmt="cn=meTorep1" (rep1:389): Replication bind with GSSAPI auth resumed
[16/Jun/2015:16:27:28 -0400] - Operation error fetching Null DN
(0ce19ce4-146611e5-8135a170-bd40e05c), error -30993.
[16/Jun/2015:16:27:28 -0400] - dn2entry_ext: Failed to get id for
changenumber=336746,cn=changelog from entryrdn index (-30993)
[16/Jun/2015:16:27:28 -0400] - Operation error fetching
changenumber=336746,cn=changelog (null), error -30993.
[16/Jun/2015:16:27:28 -0400] DSRetroclPlugin - replog: an error
occured while adding change number 336746, dn =
changenumber=336746,cn=changelog: Operations error.
[16/Jun/2015:16:27:28 -0400] retrocl-plugin - retrocl_postob:
operation failure [1]
..
[16/Jun/2015:16:27:36 -0400] - Operation error fetching
changenumber=336762,cn=changelog (null), error -30993.
[16/Jun/2015:16:27:36 -0400] DSRetroclPlugin - replog: an error
occured while adding change number 336762, dn =
changenumber=336762,cn=changelog: Operations error.
[16/Jun/2015:16:27:36 -0400] retrocl-plugin - retrocl_postob:
operation failure [1]
[16/Jun/2015:16:27:36 -0400] - ldbm_back_seq deadlock retry BAD 1601,
err=0 BDB0062 Successful return: 0
[16/Jun/2015:16:27:36 -0400] - ldbm_back_seq deadlock retry BAD 1601,
err=0 BDB0062 Successful return: 0
[17/Jun/2015:13:41:23 -0400] - slapd shutting down - signaling
operation threads - op stack size 0 max work q size 3025 max work q
stack size 5
[17/Jun/2015:13:41:23 -0400] - slapd shutting down - waiting for 30
threads to terminate


Assuming we had a corrupted database we first attempted to remove the
replicate
by logging into the first master and running:

     # ipa-replica-manage del rep2

This process hung indefinitely. So we proceeded to shutdown all ipa
services on
rep2 (systemctl stop ipa).. then re-ran the command on the first master:

     # ipa-replica-manage -v --force del rep2

This appeared to work ok and rep2 appears to have been deleted:

     # ipa-replica-manage list
     rep3: master
     rep1: master

However, when querying ldap nsDS5ReplicationAgreement we still see
rep2 with a
replica 97 id for the ipca:

# ldapsearch -Y GSSAPI -b "cn=mapping tree,cn=config"
objectClass=nsDS5ReplicationAgreement -LL

dn: cn=masterAgreement1-rep3-pki-tomcat,cn=replica,cn=ipaca,cn=mapping
tree,cn=config
objectClass: top
objectClass: nsds5replicationagreement
cn: masterAgreement1-rep3-pki-tomcat
nsDS5ReplicaRoot: o=ipaca
nsDS5ReplicaHost: rep3
nsDS5ReplicaPort: 389
nsDS5ReplicaBindDN: cn=Replication Manager
cloneAgreement1-rep3-pki-tomcat,ou=csusers,cn=config
nsDS5ReplicaBindMethod: Simple
nsDS5ReplicaTransportInfo: TLS
description: masterAgreement1-rep3-pki-tomcat
..
nsds50ruv: {replicageneration} 5527f74b000000600000
nsds50ruv: {replica 91 ldap://rep3:389} 5537c7ba0000005b
  0000 5582c7e40004005b0000
nsds50ruv: {replica 96 ldap://rep1:389} 5527f75400000060
  0000 5582cd19000000600000
nsds50ruv: {replica 97 ldap://rep2:389} 5527f76000000061
  0000 556f462b000400610000
nsruvReplicaLastModified: {replica 91 ldap://rep3:389} 0
  0000000
nsruvReplicaLastModified: {replica 96 ldap://rep1:389} 0
  0000000
nsruvReplicaLastModified: {replica 97 ldap://rep2:389} 0
  0000000
nsds5replicaLastUpdateStart: 20150619173149Z
nsds5replicaLastUpdateEnd: 20150619173149Z
nsds5replicaChangesSentSinceStartup:: OTY6MTI0LzAg
nsds5replicaLastUpdateStatus: 0 Replica acquired successfully:
Incremental upd
  ate succeeded
nsds5replicaUpdateInProgress: FALSE
nsds5replicaLastInitStart: 0
nsds5replicaLastInitEnd: 0


Questions:

0. Is it likely that after running out of file descriptors the dirsrv
slapd database on rep2 was corrupted?

That would appear to be the case based on correlation of events,
although I've never seen that happen, and it is not supposed to happen.


1. Do we have to run ipa-replica-manage del rep2 on *each* of the
remaining replica servers (rep1 and rep3)? Or should it just be run on
the first master?

I believe it should only be run on the first master, but it hung, so
something is not right, and I'm not sure how to remedy the situation.

How long did it hang, and where?

Do we need to run ipa-csreplicate-manage del as well?

2. Why does the rep2 server still appear when querying the
nsDS5ReplicationAgreement in ldap? Is this benign or will this pose
problems
when we go to add rep2 back in?

You should remove it.

And ipa-csreplica-manage is the tool to do it.


3. What steps/commands can we take to verify rep2 was successfully
removed and
replication is behaving normally?

The ldapsearch you performed already will confirm that the CA agreement has been removed.



We had tuned our servers according to the rhel  Performance Tuning
Guide:

    # cat /proc/sys/fs/file-max
    6534908

    # cat /etc/security/limits.d/40-freeipa.conf
    *         soft      nofile      8192
    *         hard      nofile      8192

    # cat /etc/sysctl.d/40-freeipa.conf
    net.ipv4.ip_local_port_range = 1024 65000
    net.ipv4.tcp_keepalive_time = 300

    # cat /etc/sysconfig/dirsrv.systemd
    [Service]
    LimitNOFILE=8192

We're now considering increasing the nofile to something larger to
prevent running out of file descriptors. Any guidance on what number to
set this to?

8192 is extremely high.  The fact that you ran out of file descriptors
at 8192 seems like a bug/fd leak somewhere.  I suppose you could, as a
very temporary workaround, set the fd limit higher, but that is no
guarantee that you won't run out again.

Please file at least 1 ticket e.g. "database corrupted when server ran
out of file descriptors", with as much information about that particular
problem as you can provide.


Many thanks in advance for any help.

--Andrew



--
Manage your subscription for the Freeipa-users mailing list:
https://www.redhat.com/mailman/listinfo/freeipa-users
Go to http://freeipa.org for more info on the project

Reply via email to