On 11/25/2014 12:32 PM, dbisc...@hrz.uni-kassel.de wrote:
Hi,

with the help of Thierry and Rich I managed to debug the running ns-slapd on Server1 (see below). The failing attempt of decoding the SASL data returns a not very fruitful "-1" (SASL_FAIL, "generic failure").

Any ideas? Short summary:

Server1 = running IPA server
Server2 = intended IPA replica

Both machines run the exact same, up-to-date version of CentOS 6.6. However: I had to run "ipa-replica-install" _without_ the option "--setup-ca" (didn't work, installation failed with some obscure Perl error), so there's no ns-slapd instance running for PKI-IPA. May this be related?

Are you asking if not having --setup-ca would cause "sasl_io_recv failed to decode packet for connection 2980"? Not that I know of.

At this point, it's going to take more than a trivial amount of high latency back-and-forth on the mailling lists. I think we have probably run out of log levels for you to try. Please open a ticket against IPA. While this may turn out to be a bug in 389, at the moment it is only reproducible in your IPA environment.

The fastest way to get to the bottom of this problem would be for a 389 developer to run an interactive gdb session on your production machine and poke around. That is, allow one of us to ssh into the machine and run gdb (which will kill the performance and cause outages unless this machine can be taken out of rotation somehow). What would we be looking for? I don't know, but hopefully we would know it when we see it.



On Fri, 21 Nov 2014, Rich Megginson wrote:

On 11/21/2014 04:51 AM, thierry bordaz wrote:
On 11/21/2014 10:59 AM, dbisc...@hrz.uni-kassel.de wrote:
On Thu, 20 Nov 2014, thierry bordaz wrote:
On 11/20/2014 12:03 PM, dbisc...@hrz.uni-kassel.de wrote:
On Thu, 20 Nov 2014, thierry bordaz wrote:

Server1 successfully replicated to Server2, but Server2 fails to replicated to Server1.

The replication Server2->Server1 is done with kerberos authentication. Server1 receives the replication session, successfully identify the replication manager, start to receives replication extop but suddenly closes the connection.


[19/Nov/2014:14:21:39 +0100] conn=2980 fd=78 slot=78 connection from
  xxx to yyy
[19/Nov/2014:14:21:39 +0100] conn=2980 op=0 BIND dn="" method=sasl
  version=3 mech=GSSAPI
  [19/Nov/2014:14:21:39 +0100] conn=2980 op=0 RESULT err=14 tag=97
  nentries=0 etime=0, SASL bind in progress
[19/Nov/2014:14:21:39 +0100] conn=2980 op=1 BIND dn="" method=sasl
  version=3 mech=GSSAPI
  [19/Nov/2014:14:21:39 +0100] conn=2980 op=1 RESULT err=14 tag=97
  nentries=0 etime=0, SASL bind in progress
[19/Nov/2014:14:21:39 +0100] conn=2980 op=2 BIND dn="" method=sasl
  version=3 mech=GSSAPI
  [19/Nov/2014:14:21:39 +0100] conn=2980 op=2 RESULT err=0 tag=97
  nentries=0 etime=0 dn="krbprincipalname=xxx"
  [19/Nov/2014:14:21:39 +0100] conn=2980 op=3 SRCH base="" scope=0
filter="(objectClass=*)" attrs="supportedControl supportedExtension"
  [19/Nov/2014:14:21:39 +0100] conn=2980 op=3 RESULT err=0 tag=101
  nentries=1 etime=0
  [19/Nov/2014:14:21:39 +0100] conn=2980 op=4 SRCH base="" scope=0
filter="(objectClass=*)" attrs="supportedControl supportedExtension"
  [19/Nov/2014:14:21:39 +0100] conn=2980 op=4 RESULT err=0 tag=101
  nentries=1 etime=0
  [19/Nov/2014:14:21:39 +0100] conn=2980 op=5 EXT
oid="2.16.840.1.113730.3.5.12" name="replication-multimaster-extop"
  [19/Nov/2014:14:21:39 +0100] conn=2980 op=5 RESULT err=0 tag=120
  nentries=0 etime=0
  [19/Nov/2014:14:21:39 +0100] conn=2980 op=6 SRCH base="cn=schema"
  scope=0 filter="(objectClass=*)" attrs="nsSchemaCSN"
  [19/Nov/2014:14:21:39 +0100] conn=2980 op=6 RESULT err=0 tag=101
  nentries=1 etime=0
  [19/Nov/2014:14:21:39 +0100] conn=2980 op=-1 fd=78 closed - I/O
  function error.

The reason of this closure is logged in server1 error log. sasl_decode fails to decode a received PDU.

[19/Nov/2014:14:21:39 +0100] - sasl_io_recv failed to decode packet
  for connection 2980

I do not know why it fails but I wonder if the received PDU is not larger than the maximum configured value. The attribute nsslapd-maxsasliosize is set to 2Mb by default. Would it be possible to increase its value (5Mb) to see if it has an impact

[...]

I set nsslapd-maxsasliosize to 6164480 on both machines, but the problem remains.

The sasl-decode fails but the exact returned value is not logged. With standard version we may need to attach a debugger and then set a conditional breakpoint in sasl-decode just after conn->oparams.decode that will fire if result !=0. Now this can change the dynamic and possibly prevent the problem to occur again. The other option is to use an instrumented version to log this value.

If I understand the mechanism correctly, Server1 needs to have debug versions of the relevant packages (probably 389-ds-base and cyrus-sasl) installed in order to track down the problem. Unfortunately, my Server1 is in production use - if I break it, my colleagues will grab forks and torches and be after me. A short downtime would be ok, though.

Is there something else I could do?

Sure I do not want to trigger so much trouble ;-)


I think my email was not clear. To go further we would need to know the exact reason why sasl_decode fails. I see two options:

  * Prepare a debug version, that would report in the error logs the
    returned valud of sasl_decode (when it fails). Except downtime to
    install the debug version, it has no impact in production.

  * Do a debug session (gdb) on Server1. The debug session will
    install a breakpoint at a specific place, let the server run,
    catch the sasl_decode failure and note the return code, exit from
    debugger.
    When the problem occurs, it happens regularly (each 5 seconds) so
    we should not have to wait long.
    That means that debugging Server1 should disturb production for 5
    to 10 min.
    A detailed procedure to do the debug session is required.

For starters: http://www.port389.org/docs/389ds/FAQ/faq.html#debugging-crashes

Since this is IPA you will need debuginfo packages for ipa and slapi-nis in addition to the ones for 389.

Take a look at the Debugging Hangs section where it describes how to use gdb to get a stack trace. If you can use that gdb command to get a stack trace with the full debugging symbols (and if you don't know what that means, just post the redacted stack trace somewhere and provide us with a link to it), then you should be all ready to do a gdb session to reproduce the error and "catch it in the act".


Mit freundlichen Gruessen/With best regards,

--Daniel.

--
Manage your subscription for the Freeipa-users mailing list:
https://www.redhat.com/mailman/listinfo/freeipa-users
Go To http://freeipa.org for more info on the project

Reply via email to