Full_Name: Bill MacAllister Version: 2.4.23-1 OS: Debian Lenny URL: http://www.stanford.edu/~whm/files/ldap-replication-error.txt Submission from: (NULL) (99.55.172.208)
I just replaced a 2.3.43 OpenLDAP master with a 2.4.23-1 master. There are 6 slaves all running 2.4.23-1. After reloading the master database, I confirmed that replication was working my making some changes and then examining contextCSNs. Sometime later replication stopped on one of the servers. Looking at the log on the failing slave I saw: Sep 2 22:12:53 directory3 slapd[7005]: do_syncrep2: rid=000 LDAP_RES_INTERMEDIATE - SYNC_ID_SET Sep 2 22:12:53 directory3 slapd[7005]: do_syncrep2: rid=000 (-1) Can't contact LDAP server Sep 2 22:12:53 directory3 slapd[7005]: do_syncrepl: rid=000 rc -1 retrying On the master I saw: Sep 2 22:12:53 directory-master0 slapd[15145]: conn=3764 fd=38 ACCEPT from IP=171.67.218.156:44519 (IP=0.0.0.0:389) Sep 2 22:12:53 directory-master0 slapd[15145]: conn=3764 op=0 BIND dn="" method=163 Sep 2 22:12:53 directory-master0 slapd[15145]: conn=3764 op=0 RESULT tag=97 err=14 text=SASL(0): successful result: Sep 2 22:12:53 directory-master0 slapd[15145]: conn=3764 op=1 BIND dn="" method=163 Sep 2 22:12:53 directory-master0 slapd[15145]: conn=3764 op=1 RESULT tag=97 err=14 text=SASL(0): successful result: Sep 2 22:12:53 directory-master0 slapd[15145]: conn=3764 op=2 BIND dn="" method=163 Sep 2 22:12:53 directory-master0 slapd[15145]: conn=3764 op=2 BIND authcid="ldap/[email protected]" authzid="ldap/[email protected]" Sep 2 22:12:53 directory-master0 slapd[15145]: conn=3764 op=2 BIND dn="cn=directory3,cn=ldap,cn=operational,dc=stanford,dc=edu" mech=GSSAPI sasl_ssf=56 ssf=56 Sep 2 22:12:53 directory-master0 slapd[15145]: conn=3764 op=2 RESULT tag=97 err=0 text= Sep 2 22:12:53 directory-master0 slapd[15145]: conn=3764 op=3 SRCH base="cn=accesslog" scope=2 deref=0 filter="(&(objectClass=auditWriteObject)(reqResult=0))" Sep 2 22:12:53 directory-master0 slapd[15145]: conn=3764 op=3 SRCH attr=reqDN reqType reqMod reqNewRDN reqDeleteOldRDN reqNewSuperior entryCSN Sep 2 22:12:53 directory-master0 slapd[15145]: conn=3764 fd=38 closed (connection lost on write) Running slapd on the failing slave with -d -1 I saw: =>do_syncrep2 rid=000 ldap_result ld 0x1c07000 msgid 4 wait4msg ld 0x1c07000 msgid 4 (timeout 0 usec) wait4msg continue ld 0x1c07000 msgid 4 all 0 ** ld 0x1c07000 Connections: * host: directory-master.stanford.edu port: 389 (default) refcnt: 2 status: Connected last used: Thu Sep 2 22:12:53 2010 ** ld 0x1c07000 Outstanding Requests: * msgid 4, origid 4, status InProgress outstanding referrals 0, parent count 0 ld 0x1c07000 request count 1 (abandoned 0) ** ld 0x1c07000 Response Queue: Empty ld 0x1c07000 response count 0 ldap_chkResponseList ld 0x1c07000 msgid 4 all 0 ldap_chkResponseList returns ld 0x1c07000 NULL ldap_int_select read1msg: ld 0x1c07000 msgid 4 all 0 ber_get_next sasl_generic_read: want=4, got=4 0000: 00 00 00 78 ...x sasl_generic_read: want=120, got=120 ... lots of output ... sasl_generic_read: want=45833, got=0 ldap_read: want=8, got=0 ber_get_next failed. ldap_err2string do_syncrep2: rid=000 (-1) Can't contact LDAP server I am including a pointer to the complete debug log. Looks like something in the sasl support is unhappy. We are using libsasl2-modules-gssapi-heimdal 2.1.23.dfsg1-5~sbp50+2. Strange to me that this one server is having a problem. There are actually 5 slaves that are working just fine. Here are current contextCSNs for the systems. directory-master contextCSN: 20100903060040.713677Z#000000#000#000000 directory1 contextCSN: 20100903060040.713677Z#000000#000#000000 directory2 contextCSN: 20100903060040.713677Z#000000#000#000000 directory3 contextCSN: 20100903015531.064407Z#000000#000#000000 directory4 contextCSN: 20100903060040.713677Z#000000#000#000000 directory11 contextCSN: 20100903060040.713677Z#000000#000#000000 directory12 contextCSN: 20100903060040.713677Z#000000#000#000000 What else should I look at? Bill
