Re: [Freeipa-users] 3.0.0-42 Replication issue after Centos6.5->6.6 upgrade

Rich Megginson Tue, 25 Nov 2014 14:19:58 -0800

On 11/25/2014 12:32 PM, [email protected] wrote:

Hi,
with the help of Thierry and Rich I managed to debug the runningns-slapd on Server1 (see below). The failing attempt of decoding theSASL data returns a not very fruitful "-1" (SASL_FAIL, "genericfailure").
Any ideas? Short summary:

Server1 = running IPA server
Server2 = intended IPA replica
Both machines run the exact same, up-to-date version of CentOS 6.6.However: I had to run "ipa-replica-install" _without_ the option"--setup-ca" (didn't work, installation failed with some obscure Perlerror), so there's no ns-slapd instance running for PKI-IPA. May thisbe related?

Are you asking if not having --setup-ca would cause "sasl_io_recv failedto decode packet for connection 2980"? Not that I know of.

At this point, it's going to take more than a trivial amount of highlatency back-and-forth on the mailling lists. I think we have probablyrun out of log levels for you to try. Please open a ticket againstIPA. While this may turn out to be a bug in 389, at the moment it isonly reproducible in your IPA environment.

The fastest way to get to the bottom of this problem would be for a 389developer to run an interactive gdb session on your production machineand poke around. That is, allow one of us to ssh into the machine andrun gdb (which will kill the performance and cause outages unless thismachine can be taken out of rotation somehow). What would we be lookingfor? I don't know, but hopefully we would know it when we see it.

On Fri, 21 Nov 2014, Rich Megginson wrote:
On 11/21/2014 04:51 AM, thierry bordaz wrote:
On 11/21/2014 10:59 AM, [email protected] wrote:
On Thu, 20 Nov 2014, thierry bordaz wrote:
On 11/20/2014 12:03 PM, [email protected] wrote:
On Thu, 20 Nov 2014, thierry bordaz wrote:
Server1 successfully replicated to Server2, but Server2 fails toreplicated to Server1.
The replication Server2->Server1 is done with kerberosauthentication. Server1 receives the replication session,successfully identify the replication manager, start to receivesreplication extop but suddenly closes the connection.
[19/Nov/2014:14:21:39 +0100] conn=2980 fd=78 slot=78connection from
  xxx to yyy
[19/Nov/2014:14:21:39 +0100] conn=2980 op=0 BIND dn=""method=sasl
  version=3 mech=GSSAPI
  [19/Nov/2014:14:21:39 +0100] conn=2980 op=0 RESULT err=14 tag=97
  nentries=0 etime=0, SASL bind in progress
[19/Nov/2014:14:21:39 +0100] conn=2980 op=1 BIND dn=""method=sasl
  version=3 mech=GSSAPI
  [19/Nov/2014:14:21:39 +0100] conn=2980 op=1 RESULT err=14 tag=97
  nentries=0 etime=0, SASL bind in progress
[19/Nov/2014:14:21:39 +0100] conn=2980 op=2 BIND dn=""method=sasl
  version=3 mech=GSSAPI
  [19/Nov/2014:14:21:39 +0100] conn=2980 op=2 RESULT err=0 tag=97
  nentries=0 etime=0 dn="krbprincipalname=xxx"
  [19/Nov/2014:14:21:39 +0100] conn=2980 op=3 SRCH base="" scope=0
filter="(objectClass=*)" attrs="supportedControlsupportedExtension"
  [19/Nov/2014:14:21:39 +0100] conn=2980 op=3 RESULT err=0 tag=101
  nentries=1 etime=0
  [19/Nov/2014:14:21:39 +0100] conn=2980 op=4 SRCH base="" scope=0
filter="(objectClass=*)" attrs="supportedControlsupportedExtension"
  [19/Nov/2014:14:21:39 +0100] conn=2980 op=4 RESULT err=0 tag=101
  nentries=1 etime=0
  [19/Nov/2014:14:21:39 +0100] conn=2980 op=5 EXT
oid="2.16.840.1.113730.3.5.12"name="replication-multimaster-extop"
  [19/Nov/2014:14:21:39 +0100] conn=2980 op=5 RESULT err=0 tag=120
  nentries=0 etime=0
  [19/Nov/2014:14:21:39 +0100] conn=2980 op=6 SRCH base="cn=schema"
  scope=0 filter="(objectClass=*)" attrs="nsSchemaCSN"
  [19/Nov/2014:14:21:39 +0100] conn=2980 op=6 RESULT err=0 tag=101
  nentries=1 etime=0
  [19/Nov/2014:14:21:39 +0100] conn=2980 op=-1 fd=78 closed - I/O
  function error.
The reason of this closure is logged in server1 error log.sasl_decode fails to decode a received PDU.
[19/Nov/2014:14:21:39 +0100] - sasl_io_recv failed to decodepacket
  for connection 2980
I do not know why it fails but I wonder if the received PDU isnot larger than the maximum configured value. The attributensslapd-maxsasliosize is set to 2Mb by default. Would it bepossible to increase its value (5Mb) to see if it has an impact
[...]
I set nsslapd-maxsasliosize to 6164480 on both machines, but theproblem remains.
The sasl-decode fails but the exact returned value is not logged.With standard version we may need to attach a debugger and thenset a conditional breakpoint in sasl-decode just afterconn->oparams.decode that will fire if result !=0. Now this canchange the dynamic and possibly prevent the problem to occuragain. The other option is to use an instrumented version to logthis value.
If I understand the mechanism correctly, Server1 needs to havedebug versions of the relevant packages (probably 389-ds-base andcyrus-sasl) installed in order to track down the problem.Unfortunately, my Server1 is in production use - if I break it, mycolleagues will grab forks and torches and be after me. A shortdowntime would be ok, though.
Is there something else I could do?
Sure I do not want to trigger so much trouble ;-)
I think my email was not clear. To go further we would need to knowthe exact reason why sasl_decode fails. I see two options:
  * Prepare a debug version, that would report in the error logs the
    returned valud of sasl_decode (when it fails). Except downtime to
    install the debug version, it has no impact in production.

  * Do a debug session (gdb) on Server1. The debug session will
    install a breakpoint at a specific place, let the server run,
    catch the sasl_decode failure and note the return code, exit from
    debugger.
    When the problem occurs, it happens regularly (each 5 seconds) so
    we should not have to wait long.
    That means that debugging Server1 should disturb production for 5
    to 10 min.
    A detailed procedure to do the debug session is required.
For starters:http://www.port389.org/docs/389ds/FAQ/faq.html#debugging-crashes
Since this is IPA you will need debuginfo packages for ipa andslapi-nis in addition to the ones for 389.
Take a look at the Debugging Hangs section where it describes how touse gdb to get a stack trace. If you can use that gdb command to geta stack trace with the full debugging symbols (and if you don't knowwhat that means, just post the redacted stack trace somewhere andprovide us with a link to it), then you should be all ready to do agdb session to reproduce the error and "catch it in the act".
Mit freundlichen Gruessen/With best regards,

--Daniel.


--
Manage your subscription for the Freeipa-users mailing list:
https://www.redhat.com/mailman/listinfo/freeipa-users
Go To http://freeipa.org for more info on the project

Re: [Freeipa-users] 3.0.0-42 Replication issue after Centos6.5->6.6 upgrade

Reply via email to