Re: [Freeipa-users] ipa replica failure

2015-07-01 Thread Andrew E. Bruno
On Thu, Jun 25, 2015 at 05:40:23PM -0400, Andrew E. Bruno wrote:
 On Mon, Jun 22, 2015 at 12:49:01PM -0400, Rob Crittenden wrote:
  
  You aren't seeing a replication agreement. You're seeing the Replication
  Update Vector (RUV).
  
  See 
  http://directory.fedoraproject.org/docs/389ds/howto/howto-cleanruv.html
  
  You need to do something like:
  
  # ldapmodify -D cn=directory manager -W -a
  dn: cn=clean 97, cn=cleanallruv, cn=tasks, cn=config
  objectclass: extensibleObject
  replica-base-dn: o=ipaca
  replica-id: 97
  cn: clean 97
  
  
  Great, thanks for the clarification.
  
  Curious what's the difference between running the ldapmodify above and
  ipa-replica-manage clean-ruv?
  
  
  Nothing, for the IPA data. This is a remanant from a CA replication
  agreement and it was an oversight not to add similar RUV management options
  to the ipa-careplica-manage tool.
  
 
 I'm still seeing some inconsistencies. Forgive me if I'm mis-interpreting any
 of this output (still learning the ropes with FreeIPA here)..
 
 Just trying to wrap my head around the RUVs. Trying to follow the docs here:
 http://directory.fedoraproject.org/docs/389ds/howto/howto-cleanruv.html
 
 And after running the ldapsearch command to check for obsolete masters
 I'm not seeing the replica ID for the old replica we deleted (rep2):
 
 
 $  ldapsearch -xLLL -D cn=directory manager -W -s sub -b cn=config 
 objectclass=nsds5replica
 Enter LDAP Password: 
 dn: cn=replica,cn=dc\3Dccr\2Cdc\3Dbuffalo\2Cdc\3Dedu,cn=mapping tree,cn=config
 cn: replica
 nsDS5Flags: 1
 objectClass: nsds5replica
 objectClass: top
 objectClass: extensibleobject
 nsDS5ReplicaType: 3
 nsDS5ReplicaRoot: dc=ccr,dc=buffalo,dc=edu
 nsds5ReplicaLegacyConsumer: off
 nsDS5ReplicaId: 4
 nsDS5ReplicaBindDN: cn=replication manager,cn=config
 nsDS5ReplicaBindDN: krbprincipalname=ldap/rep2@CCR.BUFFA
  LO.EDU,cn=services,cn=accounts,dc=ccr,dc=buffalo,dc=edu
 nsDS5ReplicaBindDN: krbprincipalname=ldap/rep3@CCR.BUFFA
  LO.EDU,cn=services,cn=accounts,dc=ccr,dc=buffalo,dc=edu
 nsState:: BABIa4xVJAABAA==
 nsDS5ReplicaName: a0957886-df9c11e4-a351aa45-2e06257b
 nsds5ReplicaChangeCount: 1687559
 nsds5replicareapactive: 0
 
 dn: cn=replica,cn=o\3Dipaca,cn=mapping tree,cn=config
 objectClass: top
 objectClass: nsDS5Replica
 objectClass: extensibleobject
 nsDS5ReplicaRoot: o=ipaca
 nsDS5ReplicaType: 3
 nsDS5ReplicaBindDN: cn=Replication Manager masterAgreement1-rep2
  falo.edu-pki-tomcat,ou=csusers,cn=config
 nsDS5ReplicaBindDN: cn=Replication Manager masterAgreement1-rep3
  falo.edu-pki-tomcat,ou=csusers,cn=config
 cn: replica
 nsDS5ReplicaId: 96
 nsDS5Flags: 1
 nsState:: YAAPa4xVAAkACgABAA==
 nsDS5ReplicaName: c458be8e-df9c11e4-a351aa45-2e06257b
 nsds5ReplicaChangeCount: 9480
 nsds5replicareapactive: 0
 
 
 I see: 
 
 dn: cn=replica,cn=dc\3Dccr\2Cdc\3Dbuffalo\2Cdc\3Dedu,cn=mapping 
 tree,cn=config)
 nsds5replicaid: 4
 
 and 
 
 dn: cn=replica,cn=o\3Dipaca,cn=mapping tree,cn=config
 nsDS5ReplicaId: 96
 
 
 In the above output I only see the old replica showing up under:
 
 nsDS5ReplicaBindDN: krbprincipalname=ldap/rep2@CCR.BUFFA...
 
 According to the docs I need the nsds5replicaid for use in the CLEANALLRUV
 task? 
 
 I also checked the RUV tombstone entry as per the docs:
 
 # ldapsearch -xLLL -D cn=directory manager -W -b dc=ccr,dc=buffalo,dc=edu 
 '((nsuniqueid=---)(objectclass=nstombstone))'
 Enter LDAP Password: 
 dn: cn=replica,cn=dc\3Dccr\2Cdc\3Dbuffalo\2Cdc\3Dedu,cn=mapping tree,cn=config
 cn: replica
 nsDS5Flags: 1
 objectClass: nsds5replica
 objectClass: top
 objectClass: extensibleobject
 nsDS5ReplicaType: 3
 nsDS5ReplicaRoot: dc=ccr,dc=buffalo,dc=edu
 nsds5ReplicaLegacyConsumer: off
 nsDS5ReplicaId: 4
 nsDS5ReplicaBindDN: cn=replication manager,cn=config
 nsDS5ReplicaBindDN: krbprincipalname=ldap/rep2@CCR.BUFFA
  LO.EDU,cn=services,cn=accounts,dc=ccr,dc=buffalo,dc=edu
 nsDS5ReplicaBindDN: krbprincipalname=ldap/rep3@CCR.BUFFA
  LO.EDU,cn=services,cn=accounts,dc=ccr,dc=buffalo,dc=edu
 nsState:: BADycYxVJAABAA==
 nsDS5ReplicaName: a0957886-df9c11e4-a351aa45-2e06257b
 nsds50ruv: {replicageneration} 5527f7110004
 nsds50ruv: {replica 4 ldap://rep1:389} 5527f77100040
  000 558c722800020004
 nsds50ruv: {replica 5 ldap://rep3:389} 5537c77300050
  000 5582c7f600060005
 nsds5agmtmaxcsn: 
 dc=ccr,dc=buffalo,dc=edu;meTorep3;rep3;389;5;558c572b000a0004
 nsruvReplicaLastModified: {replica 4 ldap://rep1:389} 55
  8c7204
 nsruvReplicaLastModified: {replica 5 ldap://rep3:389} 00
  00
 nsds5ReplicaChangeCount: 1689129
 nsds5replicareapactive: 0
 
 And only see nsds50ruv attributes for rep1, and rep3. However, still seeing
 rep2 in the nsDS5ReplicaBindDN.
 
 If I'm parsing this output correct, it appears RUVs for rep2 is already
 cleaned? If so, how come the nsDS5ReplicaBindDN still exist? 
 
 Also, why is there 

Re: [Freeipa-users] ipa replica failure

2015-06-25 Thread Andrew E. Bruno
On Mon, Jun 22, 2015 at 12:49:01PM -0400, Rob Crittenden wrote:
 
 You aren't seeing a replication agreement. You're seeing the Replication
 Update Vector (RUV).
 
 See http://directory.fedoraproject.org/docs/389ds/howto/howto-cleanruv.html
 
 You need to do something like:
 
 # ldapmodify -D cn=directory manager -W -a
 dn: cn=clean 97, cn=cleanallruv, cn=tasks, cn=config
 objectclass: extensibleObject
 replica-base-dn: o=ipaca
 replica-id: 97
 cn: clean 97
 
 
 Great, thanks for the clarification.
 
 Curious what's the difference between running the ldapmodify above and
 ipa-replica-manage clean-ruv?
 
 
 Nothing, for the IPA data. This is a remanant from a CA replication
 agreement and it was an oversight not to add similar RUV management options
 to the ipa-careplica-manage tool.
 

I'm still seeing some inconsistencies. Forgive me if I'm mis-interpreting any
of this output (still learning the ropes with FreeIPA here)..

Just trying to wrap my head around the RUVs. Trying to follow the docs here:
http://directory.fedoraproject.org/docs/389ds/howto/howto-cleanruv.html

And after running the ldapsearch command to check for obsolete masters
I'm not seeing the replica ID for the old replica we deleted (rep2):


$  ldapsearch -xLLL -D cn=directory manager -W -s sub -b cn=config 
objectclass=nsds5replica
Enter LDAP Password: 
dn: cn=replica,cn=dc\3Dccr\2Cdc\3Dbuffalo\2Cdc\3Dedu,cn=mapping tree,cn=config
cn: replica
nsDS5Flags: 1
objectClass: nsds5replica
objectClass: top
objectClass: extensibleobject
nsDS5ReplicaType: 3
nsDS5ReplicaRoot: dc=ccr,dc=buffalo,dc=edu
nsds5ReplicaLegacyConsumer: off
nsDS5ReplicaId: 4
nsDS5ReplicaBindDN: cn=replication manager,cn=config
nsDS5ReplicaBindDN: krbprincipalname=ldap/rep2@CCR.BUFFA
 LO.EDU,cn=services,cn=accounts,dc=ccr,dc=buffalo,dc=edu
nsDS5ReplicaBindDN: krbprincipalname=ldap/rep3@CCR.BUFFA
 LO.EDU,cn=services,cn=accounts,dc=ccr,dc=buffalo,dc=edu
nsState:: BABIa4xVJAABAA==
nsDS5ReplicaName: a0957886-df9c11e4-a351aa45-2e06257b
nsds5ReplicaChangeCount: 1687559
nsds5replicareapactive: 0

dn: cn=replica,cn=o\3Dipaca,cn=mapping tree,cn=config
objectClass: top
objectClass: nsDS5Replica
objectClass: extensibleobject
nsDS5ReplicaRoot: o=ipaca
nsDS5ReplicaType: 3
nsDS5ReplicaBindDN: cn=Replication Manager masterAgreement1-rep2
 falo.edu-pki-tomcat,ou=csusers,cn=config
nsDS5ReplicaBindDN: cn=Replication Manager masterAgreement1-rep3
 falo.edu-pki-tomcat,ou=csusers,cn=config
cn: replica
nsDS5ReplicaId: 96
nsDS5Flags: 1
nsState:: YAAPa4xVAAkACgABAA==
nsDS5ReplicaName: c458be8e-df9c11e4-a351aa45-2e06257b
nsds5ReplicaChangeCount: 9480
nsds5replicareapactive: 0


I see: 

dn: cn=replica,cn=dc\3Dccr\2Cdc\3Dbuffalo\2Cdc\3Dedu,cn=mapping tree,cn=config)
nsds5replicaid: 4

and 

dn: cn=replica,cn=o\3Dipaca,cn=mapping tree,cn=config
nsDS5ReplicaId: 96


In the above output I only see the old replica showing up under:

nsDS5ReplicaBindDN: krbprincipalname=ldap/rep2@CCR.BUFFA...

According to the docs I need the nsds5replicaid for use in the CLEANALLRUV
task? 

I also checked the RUV tombstone entry as per the docs:

# ldapsearch -xLLL -D cn=directory manager -W -b dc=ccr,dc=buffalo,dc=edu 
'((nsuniqueid=---)(objectclass=nstombstone))'
Enter LDAP Password: 
dn: cn=replica,cn=dc\3Dccr\2Cdc\3Dbuffalo\2Cdc\3Dedu,cn=mapping tree,cn=config
cn: replica
nsDS5Flags: 1
objectClass: nsds5replica
objectClass: top
objectClass: extensibleobject
nsDS5ReplicaType: 3
nsDS5ReplicaRoot: dc=ccr,dc=buffalo,dc=edu
nsds5ReplicaLegacyConsumer: off
nsDS5ReplicaId: 4
nsDS5ReplicaBindDN: cn=replication manager,cn=config
nsDS5ReplicaBindDN: krbprincipalname=ldap/rep2@CCR.BUFFA
 LO.EDU,cn=services,cn=accounts,dc=ccr,dc=buffalo,dc=edu
nsDS5ReplicaBindDN: krbprincipalname=ldap/rep3@CCR.BUFFA
 LO.EDU,cn=services,cn=accounts,dc=ccr,dc=buffalo,dc=edu
nsState:: BADycYxVJAABAA==
nsDS5ReplicaName: a0957886-df9c11e4-a351aa45-2e06257b
nsds50ruv: {replicageneration} 5527f7110004
nsds50ruv: {replica 4 ldap://rep1:389} 5527f77100040
 000 558c722800020004
nsds50ruv: {replica 5 ldap://rep3:389} 5537c77300050
 000 5582c7f600060005
nsds5agmtmaxcsn: 
dc=ccr,dc=buffalo,dc=edu;meTorep3;rep3;389;5;558c572b000a0004
nsruvReplicaLastModified: {replica 4 ldap://rep1:389} 55
 8c7204
nsruvReplicaLastModified: {replica 5 ldap://rep3:389} 00
 00
nsds5ReplicaChangeCount: 1689129
nsds5replicareapactive: 0

And only see nsds50ruv attributes for rep1, and rep3. However, still seeing
rep2 in the nsDS5ReplicaBindDN.

If I'm parsing this output correct, it appears RUVs for rep2 is already
cleaned? If so, how come the nsDS5ReplicaBindDN still exist? 

Also, why is there a nsds50ruv attribute for rep2 listed when I run this query
(but not the others above):


$ ldapsearch -xLLL -D cn=directory manager -W -b cn=mapping tree,cn=config 
objectClass=nsDS5ReplicationAgreement

dn: 

Re: [Freeipa-users] ipa replica failure

2015-06-22 Thread Rob Crittenden

Andrew E. Bruno wrote:

On Mon, Jun 22, 2015 at 10:02:59AM -0400, Rob Crittenden wrote:

Andrew E. Bruno wrote:

On Fri, Jun 19, 2015 at 03:18:50PM -0400, Rob Crittenden wrote:

Rich Megginson wrote:

On 06/19/2015 12:22 PM, Andrew E. Bruno wrote:


Questions:

0. Is it likely that after running out of file descriptors the dirsrv
slapd database on rep2 was corrupted?


That would appear to be the case based on correlation of events,
although I've never seen that happen, and it is not supposed to happen.



1. Do we have to run ipa-replica-manage del rep2 on *each* of the
remaining replica servers (rep1 and rep3)? Or should it just be run on
the first master?


I believe it should only be run on the first master, but it hung, so
something is not right, and I'm not sure how to remedy the situation.


How long did it hang, and where?


This command was run on rep1 (first master):

[rep1]$ ipa-replica-manage del rep2

This command hung.. (~10 minutes..) until I Ctr-C. After noticing ldap
queries were hanging on rep2 we ran this on rep2:

[rep2]$ systemctl stop ipa
(shutdown all ipa services on rep2)

Then back on rep1 (first master)

[rep1]$ ipa-replica-manage -v --force del rep2

Which appeared to work ok.




Do we need to run ipa-csreplicate-manage del as well?

2. Why does the rep2 server still appear when querying the
nsDS5ReplicationAgreement in ldap? Is this benign or will this pose
problems
when we go to add rep2 back in?


You should remove it.


And ipa-csreplica-manage is the tool to do it.


When I run this on rep1 (first master):

[rep1]$ ipa-csreplica-manage list
Directory Manager password:

rep3: master
rep1: master


[rep1]$ ipa-csreplica-manage del rep2
Directory Manager password:

'rep1' has no replication agreement for 'rep2'

But seems to still be there:

[rep1]$ ldapsearch -Y GSSAPI -b cn=mapping tree,cn=config 
objectClass=nsDS5ReplicationAgreement -LL

dn: cn=masterAgreement1-rep3-pki-tomcat,cn=replica,cn=ipaca,cn=mapping 
tree,cn=config
objectClass: top
objectClass: nsds5replicationagreement
cn: masterAgreement1-rep3-pki-tomcat
nsDS5ReplicaRoot: o=ipaca
nsDS5ReplicaHost: rep3
nsDS5ReplicaPort: 389
nsDS5ReplicaBindDN: cn=Replication Manager 
cloneAgreement1-rep3-pki-tomcat,ou=csusers,cn=config
nsDS5ReplicaBindMethod: Simple
nsDS5ReplicaTransportInfo: TLS
description: masterAgreement1-rep3-pki-tomcat
nsds50ruv: {replicageneration} 5527f74b0060
nsds50ruv: {replica 91 ldap://rep3:389} 5537c7ba005b
   5582c7e40004005b
nsds50ruv: {replica 96 ldap://rep1:389} 5527f7540060
   5582cd190060
nsds50ruv: {replica 97 ldap://rep2:389} 5527f7600061
   556f462b00040061
nsruvReplicaLastModified: {replica 91 ldap://rep3:389} 0
  000
nsruvReplicaLastModified: {replica 96 ldap://rep1:389} 0
  000
nsruvReplicaLastModified: {replica 97 ldap://rep2:389} 0
  000
nsds5replicaLastUpdateStart: 20150619193149Z
nsds5replicaLastUpdateEnd: 20150619193149Z
nsds5replicaChangesSentSinceStartup:: OTY6MTMyLzAg
nsds5replicaLastUpdateStatus: 0 Replica acquired successfully: Incremental upd
  ate succeeded
nsds5replicaUpdateInProgress: FALSE
nsds5replicaLastInitStart: 0
nsds5replicaLastInitEnd: 0


However, when I run the ldapsearch on rep3 it's not there (the
cn=ipaca,cn=mapping tree,cn=config is not listed):

[rep3]$ ldapsearch -Y GSSAPI -b cn=mapping tree,cn=config 
objectClass=nsDS5ReplicationAgreement -LL

dn: cn=meTorep1,cn=replica,cn=dc\3Dccr\2Cdc\3Dbuffalo\2C dc\3Dedu,cn=mapping 
tree,cn=config
cn: meTorep1
objectClass: nsds5replicationagreement
objectClass: top
nsDS5ReplicaTransportInfo: LDAP
description: me to rep1
nsDS5ReplicaRoot: dc=ccr,dc=buffalo,dc=edu
nsDS5ReplicaHost: rep1






3. What steps/commands can we take to verify rep2 was successfully
removed and
replication is behaving normally?


The ldapsearch you performed already will confirm that the CA agreement has
been removed.


Still showing up.. Any thoughts?

At this point we want to ensure both remaining masters are functional and
operating normally. Any other commands you recommend running to check?


You aren't seeing a replication agreement. You're seeing the Replication
Update Vector (RUV).

See http://directory.fedoraproject.org/docs/389ds/howto/howto-cleanruv.html

You need to do something like:

# ldapmodify -D cn=directory manager -W -a
dn: cn=clean 97, cn=cleanallruv, cn=tasks, cn=config
objectclass: extensibleObject
replica-base-dn: o=ipaca
replica-id: 97
cn: clean 97



Great, thanks for the clarification.

Curious what's the difference between running the ldapmodify above and
ipa-replica-manage clean-ruv?



Nothing, for the IPA data. This is a remanant from a CA replication 
agreement and it was an oversight not to add similar RUV management 
options to the ipa-careplica-manage tool.


rob

--
Manage your subscription for the Freeipa-users mailing list:
https://www.redhat.com/mailman/listinfo/freeipa-users
Go to http://freeipa.org for more info on the project


Re: [Freeipa-users] ipa replica failure

2015-06-22 Thread Andrew E. Bruno
On Mon, Jun 22, 2015 at 10:02:59AM -0400, Rob Crittenden wrote:
 Andrew E. Bruno wrote:
 On Fri, Jun 19, 2015 at 03:18:50PM -0400, Rob Crittenden wrote:
 Rich Megginson wrote:
 On 06/19/2015 12:22 PM, Andrew E. Bruno wrote:
 
 Questions:
 
 0. Is it likely that after running out of file descriptors the dirsrv
 slapd database on rep2 was corrupted?
 
 That would appear to be the case based on correlation of events,
 although I've never seen that happen, and it is not supposed to happen.
 
 
 1. Do we have to run ipa-replica-manage del rep2 on *each* of the
 remaining replica servers (rep1 and rep3)? Or should it just be run on
 the first master?
 
 I believe it should only be run on the first master, but it hung, so
 something is not right, and I'm not sure how to remedy the situation.
 
 How long did it hang, and where?
 
 This command was run on rep1 (first master):
 
 [rep1]$ ipa-replica-manage del rep2
 
 This command hung.. (~10 minutes..) until I Ctr-C. After noticing ldap
 queries were hanging on rep2 we ran this on rep2:
 
 [rep2]$ systemctl stop ipa
 (shutdown all ipa services on rep2)
 
 Then back on rep1 (first master)
 
 [rep1]$ ipa-replica-manage -v --force del rep2
 
 Which appeared to work ok.
 
 
 Do we need to run ipa-csreplicate-manage del as well?
 
 2. Why does the rep2 server still appear when querying the
 nsDS5ReplicationAgreement in ldap? Is this benign or will this pose
 problems
 when we go to add rep2 back in?
 
 You should remove it.
 
 And ipa-csreplica-manage is the tool to do it.
 
 When I run this on rep1 (first master):
 
 [rep1]$ ipa-csreplica-manage list
 Directory Manager password:
 
 rep3: master
 rep1: master
 
 
 [rep1]$ ipa-csreplica-manage del rep2
 Directory Manager password:
 
 'rep1' has no replication agreement for 'rep2'
 
 But seems to still be there:
 
 [rep1]$ ldapsearch -Y GSSAPI -b cn=mapping tree,cn=config 
 objectClass=nsDS5ReplicationAgreement -LL
 
 dn: cn=masterAgreement1-rep3-pki-tomcat,cn=replica,cn=ipaca,cn=mapping 
 tree,cn=config
 objectClass: top
 objectClass: nsds5replicationagreement
 cn: masterAgreement1-rep3-pki-tomcat
 nsDS5ReplicaRoot: o=ipaca
 nsDS5ReplicaHost: rep3
 nsDS5ReplicaPort: 389
 nsDS5ReplicaBindDN: cn=Replication Manager 
 cloneAgreement1-rep3-pki-tomcat,ou=csusers,cn=config
 nsDS5ReplicaBindMethod: Simple
 nsDS5ReplicaTransportInfo: TLS
 description: masterAgreement1-rep3-pki-tomcat
 nsds50ruv: {replicageneration} 5527f74b0060
 nsds50ruv: {replica 91 ldap://rep3:389} 5537c7ba005b
    5582c7e40004005b
 nsds50ruv: {replica 96 ldap://rep1:389} 5527f7540060
    5582cd190060
 nsds50ruv: {replica 97 ldap://rep2:389} 5527f7600061
    556f462b00040061
 nsruvReplicaLastModified: {replica 91 ldap://rep3:389} 0
   000
 nsruvReplicaLastModified: {replica 96 ldap://rep1:389} 0
   000
 nsruvReplicaLastModified: {replica 97 ldap://rep2:389} 0
   000
 nsds5replicaLastUpdateStart: 20150619193149Z
 nsds5replicaLastUpdateEnd: 20150619193149Z
 nsds5replicaChangesSentSinceStartup:: OTY6MTMyLzAg
 nsds5replicaLastUpdateStatus: 0 Replica acquired successfully: Incremental 
 upd
   ate succeeded
 nsds5replicaUpdateInProgress: FALSE
 nsds5replicaLastInitStart: 0
 nsds5replicaLastInitEnd: 0
 
 
 However, when I run the ldapsearch on rep3 it's not there (the
 cn=ipaca,cn=mapping tree,cn=config is not listed):
 
 [rep3]$ ldapsearch -Y GSSAPI -b cn=mapping tree,cn=config 
 objectClass=nsDS5ReplicationAgreement -LL
 
 dn: cn=meTorep1,cn=replica,cn=dc\3Dccr\2Cdc\3Dbuffalo\2C dc\3Dedu,cn=mapping 
 tree,cn=config
 cn: meTorep1
 objectClass: nsds5replicationagreement
 objectClass: top
 nsDS5ReplicaTransportInfo: LDAP
 description: me to rep1
 nsDS5ReplicaRoot: dc=ccr,dc=buffalo,dc=edu
 nsDS5ReplicaHost: rep1
 
 
 
 
 3. What steps/commands can we take to verify rep2 was successfully
 removed and
 replication is behaving normally?
 
 The ldapsearch you performed already will confirm that the CA agreement has
 been removed.
 
 Still showing up.. Any thoughts?
 
 At this point we want to ensure both remaining masters are functional and
 operating normally. Any other commands you recommend running to check?
 
 You aren't seeing a replication agreement. You're seeing the Replication
 Update Vector (RUV).
 
 See http://directory.fedoraproject.org/docs/389ds/howto/howto-cleanruv.html
 
 You need to do something like:
 
 # ldapmodify -D cn=directory manager -W -a
 dn: cn=clean 97, cn=cleanallruv, cn=tasks, cn=config
 objectclass: extensibleObject
 replica-base-dn: o=ipaca
 replica-id: 97
 cn: clean 97
 

Great, thanks for the clarification. 

Curious what's the difference between running the ldapmodify above and
ipa-replica-manage clean-ruv? 



--Andrew

-- 
Manage your subscription for the Freeipa-users mailing list:
https://www.redhat.com/mailman/listinfo/freeipa-users
Go to http://freeipa.org for more info on the project


Re: [Freeipa-users] ipa replica failure

2015-06-22 Thread Rob Crittenden

Andrew E. Bruno wrote:

On Fri, Jun 19, 2015 at 03:18:50PM -0400, Rob Crittenden wrote:

Rich Megginson wrote:

On 06/19/2015 12:22 PM, Andrew E. Bruno wrote:


Questions:

0. Is it likely that after running out of file descriptors the dirsrv
slapd database on rep2 was corrupted?


That would appear to be the case based on correlation of events,
although I've never seen that happen, and it is not supposed to happen.



1. Do we have to run ipa-replica-manage del rep2 on *each* of the
remaining replica servers (rep1 and rep3)? Or should it just be run on
the first master?


I believe it should only be run on the first master, but it hung, so
something is not right, and I'm not sure how to remedy the situation.


How long did it hang, and where?


This command was run on rep1 (first master):

[rep1]$ ipa-replica-manage del rep2

This command hung.. (~10 minutes..) until I Ctr-C. After noticing ldap
queries were hanging on rep2 we ran this on rep2:

[rep2]$ systemctl stop ipa
(shutdown all ipa services on rep2)

Then back on rep1 (first master)

[rep1]$ ipa-replica-manage -v --force del rep2

Which appeared to work ok.




Do we need to run ipa-csreplicate-manage del as well?

2. Why does the rep2 server still appear when querying the
nsDS5ReplicationAgreement in ldap? Is this benign or will this pose
problems
when we go to add rep2 back in?


You should remove it.


And ipa-csreplica-manage is the tool to do it.


When I run this on rep1 (first master):

[rep1]$ ipa-csreplica-manage list
Directory Manager password:

rep3: master
rep1: master


[rep1]$ ipa-csreplica-manage del rep2
Directory Manager password:

'rep1' has no replication agreement for 'rep2'

But seems to still be there:

[rep1]$ ldapsearch -Y GSSAPI -b cn=mapping tree,cn=config 
objectClass=nsDS5ReplicationAgreement -LL

dn: cn=masterAgreement1-rep3-pki-tomcat,cn=replica,cn=ipaca,cn=mapping 
tree,cn=config
objectClass: top
objectClass: nsds5replicationagreement
cn: masterAgreement1-rep3-pki-tomcat
nsDS5ReplicaRoot: o=ipaca
nsDS5ReplicaHost: rep3
nsDS5ReplicaPort: 389
nsDS5ReplicaBindDN: cn=Replication Manager 
cloneAgreement1-rep3-pki-tomcat,ou=csusers,cn=config
nsDS5ReplicaBindMethod: Simple
nsDS5ReplicaTransportInfo: TLS
description: masterAgreement1-rep3-pki-tomcat
nsds50ruv: {replicageneration} 5527f74b0060
nsds50ruv: {replica 91 ldap://rep3:389} 5537c7ba005b
   5582c7e40004005b
nsds50ruv: {replica 96 ldap://rep1:389} 5527f7540060
   5582cd190060
nsds50ruv: {replica 97 ldap://rep2:389} 5527f7600061
   556f462b00040061
nsruvReplicaLastModified: {replica 91 ldap://rep3:389} 0
  000
nsruvReplicaLastModified: {replica 96 ldap://rep1:389} 0
  000
nsruvReplicaLastModified: {replica 97 ldap://rep2:389} 0
  000
nsds5replicaLastUpdateStart: 20150619193149Z
nsds5replicaLastUpdateEnd: 20150619193149Z
nsds5replicaChangesSentSinceStartup:: OTY6MTMyLzAg
nsds5replicaLastUpdateStatus: 0 Replica acquired successfully: Incremental upd
  ate succeeded
nsds5replicaUpdateInProgress: FALSE
nsds5replicaLastInitStart: 0
nsds5replicaLastInitEnd: 0


However, when I run the ldapsearch on rep3 it's not there (the
cn=ipaca,cn=mapping tree,cn=config is not listed):

[rep3]$ ldapsearch -Y GSSAPI -b cn=mapping tree,cn=config 
objectClass=nsDS5ReplicationAgreement -LL

dn: cn=meTorep1,cn=replica,cn=dc\3Dccr\2Cdc\3Dbuffalo\2C dc\3Dedu,cn=mapping 
tree,cn=config
cn: meTorep1
objectClass: nsds5replicationagreement
objectClass: top
nsDS5ReplicaTransportInfo: LDAP
description: me to rep1
nsDS5ReplicaRoot: dc=ccr,dc=buffalo,dc=edu
nsDS5ReplicaHost: rep1






3. What steps/commands can we take to verify rep2 was successfully
removed and
replication is behaving normally?


The ldapsearch you performed already will confirm that the CA agreement has
been removed.


Still showing up.. Any thoughts?

At this point we want to ensure both remaining masters are functional and
operating normally. Any other commands you recommend running to check?


You aren't seeing a replication agreement. You're seeing the Replication 
Update Vector (RUV).


See http://directory.fedoraproject.org/docs/389ds/howto/howto-cleanruv.html

You need to do something like:

# ldapmodify -D cn=directory manager -W -a
dn: cn=clean 97, cn=cleanallruv, cn=tasks, cn=config
objectclass: extensibleObject
replica-base-dn: o=ipaca
replica-id: 97
cn: clean 97

rob


--
Manage your subscription for the Freeipa-users mailing list:
https://www.redhat.com/mailman/listinfo/freeipa-users
Go to http://freeipa.org for more info on the project


Re: [Freeipa-users] ipa replica failure

2015-06-19 Thread Andrew E. Bruno
On Fri, Jun 19, 2015 at 03:18:50PM -0400, Rob Crittenden wrote:
 Rich Megginson wrote:
 On 06/19/2015 12:22 PM, Andrew E. Bruno wrote:
 
 Questions:
 
 0. Is it likely that after running out of file descriptors the dirsrv
 slapd database on rep2 was corrupted?
 
 That would appear to be the case based on correlation of events,
 although I've never seen that happen, and it is not supposed to happen.
 
 
 1. Do we have to run ipa-replica-manage del rep2 on *each* of the
 remaining replica servers (rep1 and rep3)? Or should it just be run on
 the first master?
 
 I believe it should only be run on the first master, but it hung, so
 something is not right, and I'm not sure how to remedy the situation.
 
 How long did it hang, and where?

This command was run on rep1 (first master):

[rep1]$ ipa-replica-manage del rep2 

This command hung.. (~10 minutes..) until I Ctr-C. After noticing ldap
queries were hanging on rep2 we ran this on rep2:

[rep2]$ systemctl stop ipa
(shutdown all ipa services on rep2)

Then back on rep1 (first master)

[rep1]$ ipa-replica-manage -v --force del rep2

Which appeared to work ok.

 
 Do we need to run ipa-csreplicate-manage del as well?
 
 2. Why does the rep2 server still appear when querying the
 nsDS5ReplicationAgreement in ldap? Is this benign or will this pose
 problems
 when we go to add rep2 back in?
 
 You should remove it.
 
 And ipa-csreplica-manage is the tool to do it.

When I run this on rep1 (first master):

[rep1]$ ipa-csreplica-manage list
Directory Manager password: 

rep3: master
rep1: master


[rep1]$ ipa-csreplica-manage del rep2
Directory Manager password: 

'rep1' has no replication agreement for 'rep2'

But seems to still be there:

[rep1]$ ldapsearch -Y GSSAPI -b cn=mapping tree,cn=config 
objectClass=nsDS5ReplicationAgreement -LL

dn: cn=masterAgreement1-rep3-pki-tomcat,cn=replica,cn=ipaca,cn=mapping 
tree,cn=config
objectClass: top
objectClass: nsds5replicationagreement
cn: masterAgreement1-rep3-pki-tomcat
nsDS5ReplicaRoot: o=ipaca
nsDS5ReplicaHost: rep3
nsDS5ReplicaPort: 389
nsDS5ReplicaBindDN: cn=Replication Manager 
cloneAgreement1-rep3-pki-tomcat,ou=csusers,cn=config
nsDS5ReplicaBindMethod: Simple
nsDS5ReplicaTransportInfo: TLS
description: masterAgreement1-rep3-pki-tomcat
nsds50ruv: {replicageneration} 5527f74b0060
nsds50ruv: {replica 91 ldap://rep3:389} 5537c7ba005b
  5582c7e40004005b
nsds50ruv: {replica 96 ldap://rep1:389} 5527f7540060
  5582cd190060
nsds50ruv: {replica 97 ldap://rep2:389} 5527f7600061
  556f462b00040061
nsruvReplicaLastModified: {replica 91 ldap://rep3:389} 0
 000
nsruvReplicaLastModified: {replica 96 ldap://rep1:389} 0
 000
nsruvReplicaLastModified: {replica 97 ldap://rep2:389} 0
 000
nsds5replicaLastUpdateStart: 20150619193149Z
nsds5replicaLastUpdateEnd: 20150619193149Z
nsds5replicaChangesSentSinceStartup:: OTY6MTMyLzAg
nsds5replicaLastUpdateStatus: 0 Replica acquired successfully: Incremental upd
 ate succeeded
nsds5replicaUpdateInProgress: FALSE
nsds5replicaLastInitStart: 0
nsds5replicaLastInitEnd: 0


However, when I run the ldapsearch on rep3 it's not there (the
cn=ipaca,cn=mapping tree,cn=config is not listed):

[rep3]$ ldapsearch -Y GSSAPI -b cn=mapping tree,cn=config 
objectClass=nsDS5ReplicationAgreement -LL

dn: cn=meTorep1,cn=replica,cn=dc\3Dccr\2Cdc\3Dbuffalo\2C dc\3Dedu,cn=mapping 
tree,cn=config
cn: meTorep1
objectClass: nsds5replicationagreement
objectClass: top
nsDS5ReplicaTransportInfo: LDAP
description: me to rep1
nsDS5ReplicaRoot: dc=ccr,dc=buffalo,dc=edu
nsDS5ReplicaHost: rep1


 
 
 3. What steps/commands can we take to verify rep2 was successfully
 removed and
 replication is behaving normally?
 
 The ldapsearch you performed already will confirm that the CA agreement has
 been removed.

Still showing up.. Any thoughts? 

At this point we want to ensure both remaining masters are functional and
operating normally. Any other commands you recommend running to check? 

 
 8192 is extremely high.  The fact that you ran out of file descriptors
 at 8192 seems like a bug/fd leak somewhere.  I suppose you could, as a
 very temporary workaround, set the fd limit higher, but that is no
 guarantee that you won't run out again.
 
 Please file at least 1 ticket e.g. database corrupted when server ran
 out of file descriptors, with as much information about that particular
 problem as you can provide.
 

Will do.

Thanks very much for all the help!

--Andrew

-- 
Manage your subscription for the Freeipa-users mailing list:
https://www.redhat.com/mailman/listinfo/freeipa-users
Go to http://freeipa.org for more info on the project


Re: [Freeipa-users] ipa replica failure

2015-06-19 Thread Andrew E. Bruno
On Fri, Jun 19, 2015 at 09:08:15PM -0700, Janelle wrote:
 On 6/19/15 11:22 AM, Andrew E. Bruno wrote:
 Hello,
 
 First time trouble shooting an ipa server failure and looking for some
 guidance on how best to proceed.
 
 First some background on our setup:
 
 Servers are running freeipa v4.1.0 on CentOS 7.1.1503:
 
 - ipa-server-4.1.0-18.el7.centos.3.x86_64
 - 389-ds-base-1.3.3.1-16.el7_1.x86_64
 
 3 ipa-servers, 1 first master (rep1) and 2 (rep2, rep3) replicates. The
 replicates were setup to be ca's (i.e. ipa-replica-install --setup-ca...)
 
 We have ~3000 user accounts (~1000 active the rest disabled). We have
 ~700 hosts enrolled (all installed using ipa-client-install and running
 sssd). Hosts clients are a mix of centos 7 and centos 6.5.
 
 
 We recently discovered one of our replica servers (rep2) was not
 responding. A quick check of the dirsrv logs
 /var/log/dirsrv/slapd-/errors (sanitized):
 
  PR_Accept() failed, Netscape Portable Runtime error (Process open
  FD table is full.)
  ...
 
 The server was rebooted and after coming back up had these errors in the 
 logs:
 
  389-Directory/1.3.3.1 B2015.118.1941
  replica2:636 (/etc/dirsrv/slapd-)
 
 [16/Jun/2015:10:12:33 -0400] - libdb: BDB0060 PANIC: fatal region error 
 detected; run recovery
 [16/Jun/2015:10:12:33 -0400] - Serious Error---Failed to trickle, err=-30973 
 (BDB0087 DB_RUNRECOVERY: Fatal error, run database recovery)
 [16/Jun/2015:10:12:33 -0400] - libdb: BDB0060 PANIC: fatal region error 
 detected; run recovery
 [16/Jun/2015:10:12:33 -0400] - Serious Error---Failed in deadlock detect 
 (aborted at 0x0), err=-30973 (BDB0087 DB_RUNRECOVERY: Fatal error, run 
 database recovery)
 [16/Jun/2015:10:12:33 -0400] - libdb: BDB0060 PANIC: fatal region error 
 detected; run recovery
 [16/Jun/2015:10:12:33 -0400] - Serious Error---Failed in deadlock detect 
 (aborted at 0x0), err=-30973 (BDB0087 DB_RUNRECOVERY: Fatal error, run 
 database recovery)
 [16/Jun/2015:10:12:33 -0400] - libdb: BDB0060 PANIC: fatal region error 
 detected; run recovery
 [16/Jun/2015:10:12:33 -0400] - Serious Error---Failed to checkpoint 
 database, err=-30973 (BDB0087 DB_RUNRECOVERY: Fatal error, run database 
 recovery)
 [16/Jun/2015:10:12:33 -0400] - libdb: BDB0060 PANIC: fatal region error 
 detected; run recovery
 [16/Jun/2015:10:12:33 -0400] - Serious Error---Failed to checkpoint 
 database, err=-30973 (BDB0087 DB_RUNRECOVERY: Fatal error, run database 
 recovery)
 [16/Jun/2015:10:12:33 -0400] - libdb: BDB0060 PANIC: fatal region error 
 detected; run recovery
 [16/Jun/2015:10:12:33 -0400] - checkpoint_threadmain: log archive failed - 
 BDB0087 DB_RUNRECOVERY: Fatal error, run database recovery (-30973)
 
 [16/Jun/2015:16:24:04 -0400] - 389-Directory/1.3.3.1 B2015.118.1941 starting 
 up
 [16/Jun/2015:16:24:04 -0400] - Detected Disorderly Shutdown last time 
 Directory Server was running, recovering database.
 ...
 [16/Jun/2015:16:24:15 -0400] NSMMReplicationPlugin - 
 replica_check_for_data_reload: Warning: disordely shutdown for replica 
 dc=XXX. Check if DB RUV needs to be updated
 [16/Jun/2015:16:24:15 -0400] NSMMReplicationPlugin - Force update of 
 database RUV (from CL RUV) -  5577006800030003
 [16/Jun/2015:16:24:15 -0400] NSMMReplicationPlugin - Force update of 
 database RUV (from CL RUV) -  556f463200140004
 [16/Jun/2015:16:24:15 -0400] NSMMReplicationPlugin - Force update of 
 database RUV (from CL RUV) -  556f4631004d0005
 [16/Jun/2015:16:24:15 -0400] slapi_ldap_bind - Error: could not send 
 startTLS request: error -1 (Can't contact LDAP server) errno 111 (Connection 
 refused)
 [16/Jun/2015:16:24:15 -0400] NSMMReplicationPlugin - 
 agmt=cn=cloneAgreement1-rep2 (rep1:389): Replication bind with SIMPLE auth 
 failed: LDAP error -1 (Can't contact LDAP server) ()
 [16/Jun/2015:16:24:15 -0400] NSMMReplicationPlugin - 
 replica_check_for_data_reload: Warning: disordely shutdown for replica 
 o=ipaca. Check if DB RUV needs to be updated
 [16/Jun/2015:16:24:15 -0400] NSMMReplicationPlugin - Force update of 
 database RUV (from CL RUV) -  556f46290005005b
 [16/Jun/2015:16:24:15 -0400] set_krb5_creds - Could not get initial 
 credentials for principal [ldap/rep2] in keytab 
 [FILE:/etc/dirsrv/ds.keytab]: -1765328228 (Cannot contact any KDC for 
 requested realm)
 [16/Jun/2015:16:24:15 -0400] slapd_ldap_sasl_interactive_bind - Error: could 
 not perform interactive bind for id [] mech [GSSAPI]: LDAP error -1 (Can't 
 contact LDAP server) ((null)) errno 111 (Connection refused)
 [16/Jun/2015:16:24:15 -0400] slapi_ldap_bind - Error: could not perform 
 interactive bind for id [] authentication mechanism [GSSAPI]: error -1 
 (Can't contact LDAP server)
 [16/Jun/2015:16:24:15 -0400] NSMMReplicationPlugin - agmt=cn=meTorep1 
 (rep1:389): Replication bind with GSSAPI auth failed: LDAP error -1 (Can't 
 contact LDAP server) ()
 [16/Jun/2015:16:24:15 -0400] - Skipping CoS Definition cn=Password 
 

Re: [Freeipa-users] ipa replica failure

2015-06-19 Thread Janelle

On 6/19/15 11:22 AM, Andrew E. Bruno wrote:

Hello,

First time trouble shooting an ipa server failure and looking for some
guidance on how best to proceed.

First some background on our setup:

Servers are running freeipa v4.1.0 on CentOS 7.1.1503:

- ipa-server-4.1.0-18.el7.centos.3.x86_64
- 389-ds-base-1.3.3.1-16.el7_1.x86_64

3 ipa-servers, 1 first master (rep1) and 2 (rep2, rep3) replicates. The
replicates were setup to be ca's (i.e. ipa-replica-install --setup-ca...)

We have ~3000 user accounts (~1000 active the rest disabled). We have
~700 hosts enrolled (all installed using ipa-client-install and running
sssd). Hosts clients are a mix of centos 7 and centos 6.5.


We recently discovered one of our replica servers (rep2) was not
responding. A quick check of the dirsrv logs
/var/log/dirsrv/slapd-/errors (sanitized):

 PR_Accept() failed, Netscape Portable Runtime error (Process open
 FD table is full.)
 ...

The server was rebooted and after coming back up had these errors in the logs:

 389-Directory/1.3.3.1 B2015.118.1941
 replica2:636 (/etc/dirsrv/slapd-)

[16/Jun/2015:10:12:33 -0400] - libdb: BDB0060 PANIC: fatal region error 
detected; run recovery
[16/Jun/2015:10:12:33 -0400] - Serious Error---Failed to trickle, err=-30973 
(BDB0087 DB_RUNRECOVERY: Fatal error, run database recovery)
[16/Jun/2015:10:12:33 -0400] - libdb: BDB0060 PANIC: fatal region error 
detected; run recovery
[16/Jun/2015:10:12:33 -0400] - Serious Error---Failed in deadlock detect 
(aborted at 0x0), err=-30973 (BDB0087 DB_RUNRECOVERY: Fatal error, run database 
recovery)
[16/Jun/2015:10:12:33 -0400] - libdb: BDB0060 PANIC: fatal region error 
detected; run recovery
[16/Jun/2015:10:12:33 -0400] - Serious Error---Failed in deadlock detect 
(aborted at 0x0), err=-30973 (BDB0087 DB_RUNRECOVERY: Fatal error, run database 
recovery)
[16/Jun/2015:10:12:33 -0400] - libdb: BDB0060 PANIC: fatal region error 
detected; run recovery
[16/Jun/2015:10:12:33 -0400] - Serious Error---Failed to checkpoint database, 
err=-30973 (BDB0087 DB_RUNRECOVERY: Fatal error, run database recovery)
[16/Jun/2015:10:12:33 -0400] - libdb: BDB0060 PANIC: fatal region error 
detected; run recovery
[16/Jun/2015:10:12:33 -0400] - Serious Error---Failed to checkpoint database, 
err=-30973 (BDB0087 DB_RUNRECOVERY: Fatal error, run database recovery)
[16/Jun/2015:10:12:33 -0400] - libdb: BDB0060 PANIC: fatal region error 
detected; run recovery
[16/Jun/2015:10:12:33 -0400] - checkpoint_threadmain: log archive failed - 
BDB0087 DB_RUNRECOVERY: Fatal error, run database recovery (-30973)

[16/Jun/2015:16:24:04 -0400] - 389-Directory/1.3.3.1 B2015.118.1941 starting up
[16/Jun/2015:16:24:04 -0400] - Detected Disorderly Shutdown last time Directory 
Server was running, recovering database.
...
[16/Jun/2015:16:24:15 -0400] NSMMReplicationPlugin - 
replica_check_for_data_reload: Warning: disordely shutdown for replica dc=XXX. 
Check if DB RUV needs to be updated
[16/Jun/2015:16:24:15 -0400] NSMMReplicationPlugin - Force update of database RUV 
(from CL RUV) -  5577006800030003
[16/Jun/2015:16:24:15 -0400] NSMMReplicationPlugin - Force update of database RUV 
(from CL RUV) -  556f463200140004
[16/Jun/2015:16:24:15 -0400] NSMMReplicationPlugin - Force update of database RUV 
(from CL RUV) -  556f4631004d0005
[16/Jun/2015:16:24:15 -0400] slapi_ldap_bind - Error: could not send startTLS 
request: error -1 (Can't contact LDAP server) errno 111 (Connection refused)
[16/Jun/2015:16:24:15 -0400] NSMMReplicationPlugin - 
agmt=cn=cloneAgreement1-rep2 (rep1:389): Replication bind with SIMPLE auth 
failed: LDAP error -1 (Can't contact LDAP server) ()
[16/Jun/2015:16:24:15 -0400] NSMMReplicationPlugin - 
replica_check_for_data_reload: Warning: disordely shutdown for replica o=ipaca. 
Check if DB RUV needs to be updated
[16/Jun/2015:16:24:15 -0400] NSMMReplicationPlugin - Force update of database RUV 
(from CL RUV) -  556f46290005005b
[16/Jun/2015:16:24:15 -0400] set_krb5_creds - Could not get initial credentials 
for principal [ldap/rep2] in keytab [FILE:/etc/dirsrv/ds.keytab]: -1765328228 
(Cannot contact any KDC for requested realm)
[16/Jun/2015:16:24:15 -0400] slapd_ldap_sasl_interactive_bind - Error: could 
not perform interactive bind for id [] mech [GSSAPI]: LDAP error -1 (Can't 
contact LDAP server) ((null)) errno 111 (Connection refused)
[16/Jun/2015:16:24:15 -0400] slapi_ldap_bind - Error: could not perform 
interactive bind for id [] authentication mechanism [GSSAPI]: error -1 (Can't 
contact LDAP server)
[16/Jun/2015:16:24:15 -0400] NSMMReplicationPlugin - agmt=cn=meTorep1 
(rep1:389): Replication bind with GSSAPI auth failed: LDAP error -1 (Can't contact LDAP 
server) ()
[16/Jun/2015:16:24:15 -0400] - Skipping CoS Definition cn=Password 
Policy,cn=accounts,dc=xxx--no CoS Templates found, which should be added before 
the CoS Definition.
[16/Jun/2015:16:24:15 -0400] DSRetroclPlugin - delete_changerecord: could not 

Re: [Freeipa-users] ipa replica failure

2015-06-19 Thread Rich Megginson

On 06/19/2015 12:22 PM, Andrew E. Bruno wrote:

Hello,

First time trouble shooting an ipa server failure and looking for some
guidance on how best to proceed.

First some background on our setup:

Servers are running freeipa v4.1.0 on CentOS 7.1.1503:

- ipa-server-4.1.0-18.el7.centos.3.x86_64
- 389-ds-base-1.3.3.1-16.el7_1.x86_64

3 ipa-servers, 1 first master (rep1) and 2 (rep2, rep3) replicates. The
replicates were setup to be ca's (i.e. ipa-replica-install --setup-ca...)

We have ~3000 user accounts (~1000 active the rest disabled). We have
~700 hosts enrolled (all installed using ipa-client-install and running
sssd). Hosts clients are a mix of centos 7 and centos 6.5.


We recently discovered one of our replica servers (rep2) was not
responding. A quick check of the dirsrv logs
/var/log/dirsrv/slapd-/errors (sanitized):

 PR_Accept() failed, Netscape Portable Runtime error (Process open
 FD table is full.)
 ...

The server was rebooted and after coming back up had these errors in the logs:

 389-Directory/1.3.3.1 B2015.118.1941
 replica2:636 (/etc/dirsrv/slapd-)

[16/Jun/2015:10:12:33 -0400] - libdb: BDB0060 PANIC: fatal region error 
detected; run recovery
[16/Jun/2015:10:12:33 -0400] - Serious Error---Failed to trickle, err=-30973 
(BDB0087 DB_RUNRECOVERY: Fatal error, run database recovery)
[16/Jun/2015:10:12:33 -0400] - libdb: BDB0060 PANIC: fatal region error 
detected; run recovery
[16/Jun/2015:10:12:33 -0400] - Serious Error---Failed in deadlock detect 
(aborted at 0x0), err=-30973 (BDB0087 DB_RUNRECOVERY: Fatal error, run database 
recovery)
[16/Jun/2015:10:12:33 -0400] - libdb: BDB0060 PANIC: fatal region error 
detected; run recovery
[16/Jun/2015:10:12:33 -0400] - Serious Error---Failed in deadlock detect 
(aborted at 0x0), err=-30973 (BDB0087 DB_RUNRECOVERY: Fatal error, run database 
recovery)
[16/Jun/2015:10:12:33 -0400] - libdb: BDB0060 PANIC: fatal region error 
detected; run recovery
[16/Jun/2015:10:12:33 -0400] - Serious Error---Failed to checkpoint database, 
err=-30973 (BDB0087 DB_RUNRECOVERY: Fatal error, run database recovery)
[16/Jun/2015:10:12:33 -0400] - libdb: BDB0060 PANIC: fatal region error 
detected; run recovery
[16/Jun/2015:10:12:33 -0400] - Serious Error---Failed to checkpoint database, 
err=-30973 (BDB0087 DB_RUNRECOVERY: Fatal error, run database recovery)
[16/Jun/2015:10:12:33 -0400] - libdb: BDB0060 PANIC: fatal region error 
detected; run recovery
[16/Jun/2015:10:12:33 -0400] - checkpoint_threadmain: log archive failed - 
BDB0087 DB_RUNRECOVERY: Fatal error, run database recovery (-30973)

[16/Jun/2015:16:24:04 -0400] - 389-Directory/1.3.3.1 B2015.118.1941 starting up
[16/Jun/2015:16:24:04 -0400] - Detected Disorderly Shutdown last time Directory 
Server was running, recovering database.
...
[16/Jun/2015:16:24:15 -0400] NSMMReplicationPlugin - 
replica_check_for_data_reload: Warning: disordely shutdown for replica dc=XXX. 
Check if DB RUV needs to be updated
[16/Jun/2015:16:24:15 -0400] NSMMReplicationPlugin - Force update of database RUV 
(from CL RUV) -  5577006800030003
[16/Jun/2015:16:24:15 -0400] NSMMReplicationPlugin - Force update of database RUV 
(from CL RUV) -  556f463200140004
[16/Jun/2015:16:24:15 -0400] NSMMReplicationPlugin - Force update of database RUV 
(from CL RUV) -  556f4631004d0005
[16/Jun/2015:16:24:15 -0400] slapi_ldap_bind - Error: could not send startTLS 
request: error -1 (Can't contact LDAP server) errno 111 (Connection refused)
[16/Jun/2015:16:24:15 -0400] NSMMReplicationPlugin - 
agmt=cn=cloneAgreement1-rep2 (rep1:389): Replication bind with SIMPLE auth 
failed: LDAP error -1 (Can't contact LDAP server) ()
[16/Jun/2015:16:24:15 -0400] NSMMReplicationPlugin - 
replica_check_for_data_reload: Warning: disordely shutdown for replica o=ipaca. 
Check if DB RUV needs to be updated
[16/Jun/2015:16:24:15 -0400] NSMMReplicationPlugin - Force update of database RUV 
(from CL RUV) -  556f46290005005b
[16/Jun/2015:16:24:15 -0400] set_krb5_creds - Could not get initial credentials 
for principal [ldap/rep2] in keytab [FILE:/etc/dirsrv/ds.keytab]: -1765328228 
(Cannot contact any KDC for requested realm)
[16/Jun/2015:16:24:15 -0400] slapd_ldap_sasl_interactive_bind - Error: could 
not perform interactive bind for id [] mech [GSSAPI]: LDAP error -1 (Can't 
contact LDAP server) ((null)) errno 111 (Connection refused)
[16/Jun/2015:16:24:15 -0400] slapi_ldap_bind - Error: could not perform 
interactive bind for id [] authentication mechanism [GSSAPI]: error -1 (Can't 
contact LDAP server)
[16/Jun/2015:16:24:15 -0400] NSMMReplicationPlugin - agmt=cn=meTorep1 
(rep1:389): Replication bind with GSSAPI auth failed: LDAP error -1 (Can't contact LDAP 
server) ()
[16/Jun/2015:16:24:15 -0400] - Skipping CoS Definition cn=Password 
Policy,cn=accounts,dc=xxx--no CoS Templates found, which should be added before 
the CoS Definition.
[16/Jun/2015:16:24:15 -0400] DSRetroclPlugin - delete_changerecord: could 

Re: [Freeipa-users] ipa replica failure

2015-06-19 Thread Rob Crittenden

Rich Megginson wrote:

On 06/19/2015 12:22 PM, Andrew E. Bruno wrote:

Hello,

First time trouble shooting an ipa server failure and looking for some
guidance on how best to proceed.

First some background on our setup:

Servers are running freeipa v4.1.0 on CentOS 7.1.1503:

- ipa-server-4.1.0-18.el7.centos.3.x86_64
- 389-ds-base-1.3.3.1-16.el7_1.x86_64

3 ipa-servers, 1 first master (rep1) and 2 (rep2, rep3) replicates. The
replicates were setup to be ca's (i.e. ipa-replica-install --setup-ca...)

We have ~3000 user accounts (~1000 active the rest disabled). We have
~700 hosts enrolled (all installed using ipa-client-install and running
sssd). Hosts clients are a mix of centos 7 and centos 6.5.


We recently discovered one of our replica servers (rep2) was not
responding. A quick check of the dirsrv logs
/var/log/dirsrv/slapd-/errors (sanitized):

 PR_Accept() failed, Netscape Portable Runtime error (Process open
 FD table is full.)
 ...

The server was rebooted and after coming back up had these errors in
the logs:

 389-Directory/1.3.3.1 B2015.118.1941
 replica2:636 (/etc/dirsrv/slapd-)

[16/Jun/2015:10:12:33 -0400] - libdb: BDB0060 PANIC: fatal region
error detected; run recovery
[16/Jun/2015:10:12:33 -0400] - Serious Error---Failed to trickle,
err=-30973 (BDB0087 DB_RUNRECOVERY: Fatal error, run database recovery)
[16/Jun/2015:10:12:33 -0400] - libdb: BDB0060 PANIC: fatal region
error detected; run recovery
[16/Jun/2015:10:12:33 -0400] - Serious Error---Failed in deadlock
detect (aborted at 0x0), err=-30973 (BDB0087 DB_RUNRECOVERY: Fatal
error, run database recovery)
[16/Jun/2015:10:12:33 -0400] - libdb: BDB0060 PANIC: fatal region
error detected; run recovery
[16/Jun/2015:10:12:33 -0400] - Serious Error---Failed in deadlock
detect (aborted at 0x0), err=-30973 (BDB0087 DB_RUNRECOVERY: Fatal
error, run database recovery)
[16/Jun/2015:10:12:33 -0400] - libdb: BDB0060 PANIC: fatal region
error detected; run recovery
[16/Jun/2015:10:12:33 -0400] - Serious Error---Failed to checkpoint
database, err=-30973 (BDB0087 DB_RUNRECOVERY: Fatal error, run
database recovery)
[16/Jun/2015:10:12:33 -0400] - libdb: BDB0060 PANIC: fatal region
error detected; run recovery
[16/Jun/2015:10:12:33 -0400] - Serious Error---Failed to checkpoint
database, err=-30973 (BDB0087 DB_RUNRECOVERY: Fatal error, run
database recovery)
[16/Jun/2015:10:12:33 -0400] - libdb: BDB0060 PANIC: fatal region
error detected; run recovery
[16/Jun/2015:10:12:33 -0400] - checkpoint_threadmain: log archive
failed - BDB0087 DB_RUNRECOVERY: Fatal error, run database recovery
(-30973)

[16/Jun/2015:16:24:04 -0400] - 389-Directory/1.3.3.1 B2015.118.1941
starting up
[16/Jun/2015:16:24:04 -0400] - Detected Disorderly Shutdown last time
Directory Server was running, recovering database.
...
[16/Jun/2015:16:24:15 -0400] NSMMReplicationPlugin -
replica_check_for_data_reload: Warning: disordely shutdown for replica
dc=XXX. Check if DB RUV needs to be updated
[16/Jun/2015:16:24:15 -0400] NSMMReplicationPlugin - Force update of
database RUV (from CL RUV) -  5577006800030003
[16/Jun/2015:16:24:15 -0400] NSMMReplicationPlugin - Force update of
database RUV (from CL RUV) -  556f463200140004
[16/Jun/2015:16:24:15 -0400] NSMMReplicationPlugin - Force update of
database RUV (from CL RUV) -  556f4631004d0005
[16/Jun/2015:16:24:15 -0400] slapi_ldap_bind - Error: could not send
startTLS request: error -1 (Can't contact LDAP server) errno 111
(Connection refused)
[16/Jun/2015:16:24:15 -0400] NSMMReplicationPlugin -
agmt=cn=cloneAgreement1-rep2 (rep1:389): Replication bind with SIMPLE
auth failed: LDAP error -1 (Can't contact LDAP server) ()
[16/Jun/2015:16:24:15 -0400] NSMMReplicationPlugin -
replica_check_for_data_reload: Warning: disordely shutdown for replica
o=ipaca. Check if DB RUV needs to be updated
[16/Jun/2015:16:24:15 -0400] NSMMReplicationPlugin - Force update of
database RUV (from CL RUV) -  556f46290005005b
[16/Jun/2015:16:24:15 -0400] set_krb5_creds - Could not get initial
credentials for principal [ldap/rep2] in keytab
[FILE:/etc/dirsrv/ds.keytab]: -1765328228 (Cannot contact any KDC for
requested realm)
[16/Jun/2015:16:24:15 -0400] slapd_ldap_sasl_interactive_bind - Error:
could not perform interactive bind for id [] mech [GSSAPI]: LDAP error
-1 (Can't contact LDAP server) ((null)) errno 111 (Connection refused)
[16/Jun/2015:16:24:15 -0400] slapi_ldap_bind - Error: could not
perform interactive bind for id [] authentication mechanism [GSSAPI]:
error -1 (Can't contact LDAP server)
[16/Jun/2015:16:24:15 -0400] NSMMReplicationPlugin -
agmt=cn=meTorep1 (rep1:389): Replication bind with GSSAPI auth
failed: LDAP error -1 (Can't contact LDAP server) ()
[16/Jun/2015:16:24:15 -0400] - Skipping CoS Definition cn=Password
Policy,cn=accounts,dc=xxx--no CoS Templates found, which should be
added before the CoS Definition.
[16/Jun/2015:16:24:15 -0400] DSRetroclPlugin - delete_changerecord:
could not delete