On 04 May 2014, at 10:18 PM, Graham Leggett <[email protected]> wrote:

> We got "ber_get_next failed for connection 2" without being told what exactly 
> failed. Will dig on the other side to see if there are any clues.

On the other side we see this:

[04/May/2014:21:29:10 +0100] NSMMReplicationPlugin - agmt_add: begin
[04/May/2014:21:29:10 +0100] NSMMReplicationPlugin - agmt="cn=Agreement 
serverc.example.com" (serverc:636): Trying secure slapi_ldap_init_ext
[04/May/2014:21:29:10 +0100] NSMMReplicationPlugin - agmt="cn=Agreement 
serverc.example.com" (serverc:636): binddn = cn=Replication Manager,c
n=config,  passwd = {DES}xxx
[04/May/2014:21:29:11 +0100] NSMMReplicationPlugin - agmt="cn=Agreement 
serverc.example.com" (serverc:636): No linger to cancel on the connec
tion
[04/May/2014:21:29:12 +0100] - _csngen_adjust_local_time: gen state before 
53670e7b0001:1399235244:0:27599
[04/May/2014:21:29:12 +0100] - _csngen_adjust_local_time: gen state after 
53670ee70000:1399235352:0:27599
[04/May/2014:21:29:12 +0100] NSMMReplicationPlugin - agmt="cn=Agreement 
serverc.example.com" (serverc:636): Replica was successfully acquired
.
[04/May/2014:21:29:12 +0100] NSMMReplicationPlugin - Beginning total update of 
replica "agmt="cn=Agreement serverc.example.com" (serverc:636)
".
[04/May/2014:21:29:12 +0100] - conn 2 leaving turbo mode due to 4
[04/May/2014:21:29:12 +0100] - listener got signaled
[04/May/2014:21:29:19 +0100] NSMMReplicationPlugin - agmt="cn=Agreement 
serverc.example.com" (serverc:636): Failed to send extended operation: LDAP 
error -1 (Can't contact LDAP server)

The above message is printed by the following code:

            slapi_log_error(SLAPI_LOG_FATAL, repl_plugin_name,
                                        "%s: Failed to send %s operation: LDAP 
error %d (%s)\n",
                                        agmt_get_long_name(conn->agmt),
                                        op_string ? op_string : "NULL", rc, 
ldap_err2string(rc));

The above error is bogus - there is no such error as "error -1" in LDAP, it 
turns out that at the top of perform_operation() the rc variable is set as 
follows and appears to drop through to the error message without anything 
happening:

        int rc = -1;

Looking at how rc could drop through this code untouched we see this below, 
there is no default section in the switch, and so it seems optype has a value 
other than those handled:

                switch (optype)
                {
                case CONN_ADD:
                        conn->status = STATUS_PROCESSING_ADD;
                        op_string = "add";
                        rc = ldap_add_ext(conn->ld, dn, attrs, server_controls,
                                NULL /* clientctls */,  &msgid);
                        break;
                case CONN_MODIFY:
                        conn->status = STATUS_PROCESSING_MODIFY;
                        op_string = "modify";
                        rc = ldap_modify_ext(conn->ld, dn, attrs, 
server_controls,
                                NULL /* clientctls */,  &msgid);
                        break;
                case CONN_DELETE:
                        conn->status = STATUS_PROCESSING_DELETE;
                        op_string = "delete";
                        rc = ldap_delete_ext(conn->ld, dn, server_controls,
                                NULL /* clientctls */, &msgid);
                        break;
                case CONN_RENAME:
                        conn->status = STATUS_PROCESSING_RENAME;
                        op_string = "rename";
                        rc = ldap_rename(conn->ld, dn, newrdn, newparent, 
deleteoldrdn,
                                server_controls, NULL /* clientctls */, &msgid);
                        break;
                case CONN_EXTENDED_OPERATION:
                        conn->status = STATUS_PROCESSING_EXTENDED_OPERATION;
                        op_string = "extended";
                        rc = ldap_extended_operation(conn->ld, extop_oid, 
extop_payload,
                                server_controls, NULL /* clientctls */, &msgid);
                }

Putting in a breakpoint at perform_operation() shows that optype each time is 
always CONN_EXTENDED_OPERATION, so it is looking like ldap_extended_operation() 
is returning -1. The code is optimised, so I can't be sure though.

The log continues:

[04/May/2014:21:29:19 +0100] NSMMReplicationPlugin - agmt="cn=Agreement 
serverc.example.com" (serverc:636): Disconnected from the consumer
[04/May/2014:21:29:19 +0100] NSMMReplicationPlugin - agmt="cn=Agreement 
serverc.example.com" (serverc:636): Connection disconnected by another thread
[04/May/2014:21:29:19 +0100] NSMMReplicationPlugin - agmt="cn=Agreement 
serverc.example.com" (serverc:636): Received error -1 (Can't contact LDAP 
server):  for total update operation
[04/May/2014:21:29:19 +0100] - repl5_tot_waitfor_async_results: 258 -1
[04/May/2014:21:29:20 +0100] NSMMReplicationPlugin - agmt="cn=Agreement 
serverc.example.com" (serverc:636): Warning: unable to send endReplication 
extended operation (Can't contact LDAP server)
[04/May/2014:21:29:20 +0100] NSMMReplicationPlugin - agmt="cn=Agreement 
serverc.example.com" (serverc:636): repl5_tot_run: failed to obtain data to 
send to the consumer; LDAP error - -2
[04/May/2014:21:29:20 +0100] NSMMReplicationPlugin - agmt="cn=Agreement 
serverc.example.com" (serverc:636): No linger to cancel on the connection
[04/May/2014:21:29:20 +0100] NSMMReplicationPlugin - agmt="cn=Agreement 
serverc.example.com" (serverc:636): Disconnected from the consumer

It appears having failed above we pretend we didn't and try and connect again. 
This time we succeed, but because we've never initialised successfully, we 
throw our error "Replica has a different generation ID than the local data.":

[04/May/2014:21:29:20 +0100] NSMMReplicationPlugin - agmt="cn=Agreement 
serverc.example.com" (serverc:636): State: start -> ready_to_acquire_replica
[04/May/2014:21:29:20 +0100] NSMMReplicationPlugin - agmt="cn=Agreement 
serverc.example.com" (serverc:636): Trying secure slapi_ldap_init_ext
[04/May/2014:21:29:20 +0100] NSMMReplicationPlugin - agmt="cn=Agreement 
serverc.example.com" (serverc:636): binddn = cn=Replication Manager,cn=config,  
passwd = {DES}xxx
[04/May/2014:21:29:21 +0100] NSMMReplicationPlugin - agmt="cn=Agreement 
serverc.example.com" (serverc:636): Replication bind with SIMPLE auth resumed
[04/May/2014:21:29:21 +0100] NSMMReplicationPlugin - agmt="cn=Agreement 
serverc.example.com" (serverc:636): No linger to cancel on the connection
[04/May/2014:21:29:21 +0100] - _csngen_adjust_local_time: gen state before 
53670ee70001:1399235352:0:27599
[04/May/2014:21:29:21 +0100] - _csngen_adjust_local_time: gen state after 
53670ef00000:1399235361:0:27599
[04/May/2014:21:29:21 +0100] NSMMReplicationPlugin - agmt="cn=Agreement 
serverc.example.com" (serverc:636): Replica was successfully acquired.
[04/May/2014:21:29:21 +0100] NSMMReplicationPlugin - agmt="cn=Agreement 
serverc.example.com" (serverc:636): State: ready_to_acquire_replica -> 
sending_updates
[04/May/2014:21:29:21 +0100] NSMMReplicationPlugin - agmt="cn=Agreement 
serverc.example.com" (serverc:636): Replica has a different generation ID than 
the local data.
[04/May/2014:21:29:22 +0100] NSMMReplicationPlugin - agmt="cn=Agreement 
serverc.example.com" (serverc:636): Successfully released consumer
[04/May/2014:21:29:22 +0100] NSMMReplicationPlugin - agmt="cn=Agreement 
serverc.example.com" (serverc:636): Beginning linger on the connection
[04/May/2014:21:29:22 +0100] NSMMReplicationPlugin - agmt="cn=Agreement 
serverc.example.com" (serverc:636): State: sending_updates -> start_backoff

Regards,
Graham
--

--
389 users mailing list
[email protected]
https://admin.fedoraproject.org/mailman/listinfo/389-users

Reply via email to