Re: [Freeipa-users] 3.0.0-42 Replication issue after Centos6.5->6.6 upgrade

Rich Megginson Fri, 21 Nov 2014 07:53:39 -0800

On 11/21/2014 04:51 AM, thierry bordaz wrote:

On 11/21/2014 10:59 AM, [email protected] wrote:
Hi,
On Thu, 20 Nov 2014, thierry bordaz wrote:
On 11/20/2014 12:03 PM, [email protected] wrote:
On Thu, 20 Nov 2014, thierry bordaz wrote:
Server1 successfully replicated to Server2, but Server2 fails toreplicated to Server1.
The replication Server2->Server1 is done with kerberosauthentication. Server1 receives the replication session,successfully identify the replication manager, start to receivesreplication extop but suddenly closes the connection.
[19/Nov/2014:14:21:39 +0100] conn=2980 fd=78 slot=78 connectionfrom
  xxx to yyy
  [19/Nov/2014:14:21:39 +0100] conn=2980 op=0 BIND dn="" method=sasl
  version=3 mech=GSSAPI
  [19/Nov/2014:14:21:39 +0100] conn=2980 op=0 RESULT err=14 tag=97
  nentries=0 etime=0, SASL bind in progress
  [19/Nov/2014:14:21:39 +0100] conn=2980 op=1 BIND dn="" method=sasl
  version=3 mech=GSSAPI
  [19/Nov/2014:14:21:39 +0100] conn=2980 op=1 RESULT err=14 tag=97
  nentries=0 etime=0, SASL bind in progress
  [19/Nov/2014:14:21:39 +0100] conn=2980 op=2 BIND dn="" method=sasl
  version=3 mech=GSSAPI
  [19/Nov/2014:14:21:39 +0100] conn=2980 op=2 RESULT err=0 tag=97
  nentries=0 etime=0 dn="krbprincipalname=xxx"
  [19/Nov/2014:14:21:39 +0100] conn=2980 op=3 SRCH base="" scope=0
filter="(objectClass=*)" attrs="supportedControlsupportedExtension"
  [19/Nov/2014:14:21:39 +0100] conn=2980 op=3 RESULT err=0 tag=101
  nentries=1 etime=0
  [19/Nov/2014:14:21:39 +0100] conn=2980 op=4 SRCH base="" scope=0
filter="(objectClass=*)" attrs="supportedControlsupportedExtension"
  [19/Nov/2014:14:21:39 +0100] conn=2980 op=4 RESULT err=0 tag=101
  nentries=1 etime=0
  [19/Nov/2014:14:21:39 +0100] conn=2980 op=5 EXT
  oid="2.16.840.1.113730.3.5.12" name="replication-multimaster-extop"
  [19/Nov/2014:14:21:39 +0100] conn=2980 op=5 RESULT err=0 tag=120
  nentries=0 etime=0
  [19/Nov/2014:14:21:39 +0100] conn=2980 op=6 SRCH base="cn=schema"
  scope=0 filter="(objectClass=*)" attrs="nsSchemaCSN"
  [19/Nov/2014:14:21:39 +0100] conn=2980 op=6 RESULT err=0 tag=101
  nentries=1 etime=0
  [19/Nov/2014:14:21:39 +0100] conn=2980 op=-1 fd=78 closed - I/O
  function error.
The reason of this closure is logged in server1 error log.sasl_decode fails to decode a received PDU.
  [19/Nov/2014:14:21:39 +0100] - sasl_io_recv failed to decode packet
  for connection 2980
I do not know why it fails but I wonder if the received PDU is notlarger than the maximum configured value. The attributensslapd-maxsasliosize is set to 2Mb by default. Would it bepossible to increase its value (5Mb) to see if it has an impact
[...]
I set nsslapd-maxsasliosize to 6164480 on both machines, but theproblem remains.
The sasl-decode fails but the exact returned value is not logged.With standard version we may need to attach a debugger and then seta conditional breakpoint in sasl-decode just afterconn->oparams.decode that will fire if result !=0. Now this canchange the dynamic and possibly prevent the problem to occur again.The other option is to use an instrumented version to log this value.
If I understand the mechanism correctly, Server1 needs to have debugversions of the relevant packages (probably 389-ds-base andcyrus-sasl) installed in order to track down the problem.Unfortunately, my Server1 is in production use - if I break it, mycolleagues will grab forks and torches and be after me. A shortdowntime would be ok, though.
Is there something else I could do?
Hello,

Sure I do not want to trigger so much trouble ;-)
I think my email was not clear. To go further we would need to knowthe exact reason why sasl_decode fails. I see two options:
  * Prepare a debug version, that would report in the error logs the
    returned valud of sasl_decode (when it fails). Except downtime to
    install the debug version, it has no impact in production.

  * Do a debug session (gdb) on Server1. The debug session will
    install a breakpoint at a specific place, let the server run,
    catch the sasl_decode failure and note the return code, exit from
    debugger.
    When the problem occurs, it happens regularly (each 5 seconds) so
    we should not have to wait long.
    That means that debugging Server1 should disturb production for 5
    to 10 min.
    A detailed procedure to do the debug session is required.

For starters:http://www.port389.org/docs/389ds/FAQ/faq.html#debugging-crashes

Since this is IPA you will need debuginfo packages for ipa and slapi-nisin addition to the ones for 389.

Take a look at the Debugging Hangs section where it describes how to usegdb to get a stack trace. If you can use that gdb command to get astack trace with the full debugging symbols (and if you don't know whatthat means, just post the redacted stack trace somewhere and provide uswith a link to it), then you should be all ready to do a gdb session toreproduce the error and "catch it in the act".

 *

thanks
thierry



Mit freundlichen Gruessen/With best regards,

--Daniel.

-- 
Manage your subscription for the Freeipa-users mailing list:
https://www.redhat.com/mailman/listinfo/freeipa-users
Go To http://freeipa.org for more info on the project

Re: [Freeipa-users] 3.0.0-42 Replication issue after Centos6.5->6.6 upgrade

Reply via email to