On 11/25/2014 12:32 PM, [email protected] wrote:
Hi,
with the help of Thierry and Rich I managed to debug the running
ns-slapd on Server1 (see below). The failing attempt of decoding the
SASL data returns a not very fruitful "-1" (SASL_FAIL, "generic
failure").
Any ideas? Short summary:
Server1 = running IPA server
Server2 = intended IPA replica
Both machines run the exact same, up-to-date version of CentOS 6.6.
However: I had to run "ipa-replica-install" _without_ the option
"--setup-ca" (didn't work, installation failed with some obscure Perl
error), so there's no ns-slapd instance running for PKI-IPA. May this
be related?
Are you asking if not having --setup-ca would cause "sasl_io_recv failed
to decode packet for connection 2980"? Not that I know of.
At this point, it's going to take more than a trivial amount of high
latency back-and-forth on the mailling lists. I think we have probably
run out of log levels for you to try. Please open a ticket against
IPA. While this may turn out to be a bug in 389, at the moment it is
only reproducible in your IPA environment.
The fastest way to get to the bottom of this problem would be for a 389
developer to run an interactive gdb session on your production machine
and poke around. That is, allow one of us to ssh into the machine and
run gdb (which will kill the performance and cause outages unless this
machine can be taken out of rotation somehow). What would we be looking
for? I don't know, but hopefully we would know it when we see it.
On Fri, 21 Nov 2014, Rich Megginson wrote:
On 11/21/2014 04:51 AM, thierry bordaz wrote:
On 11/21/2014 10:59 AM, [email protected] wrote:
On Thu, 20 Nov 2014, thierry bordaz wrote:
On 11/20/2014 12:03 PM, [email protected] wrote:
On Thu, 20 Nov 2014, thierry bordaz wrote:
Server1 successfully replicated to Server2, but Server2 fails to
replicated to Server1.
The replication Server2->Server1 is done with kerberos
authentication. Server1 receives the replication session,
successfully identify the replication manager, start to receives
replication extop but suddenly closes the connection.
[19/Nov/2014:14:21:39 +0100] conn=2980 fd=78 slot=78
connection from
xxx to yyy
[19/Nov/2014:14:21:39 +0100] conn=2980 op=0 BIND dn=""
method=sasl
version=3 mech=GSSAPI
[19/Nov/2014:14:21:39 +0100] conn=2980 op=0 RESULT err=14 tag=97
nentries=0 etime=0, SASL bind in progress
[19/Nov/2014:14:21:39 +0100] conn=2980 op=1 BIND dn=""
method=sasl
version=3 mech=GSSAPI
[19/Nov/2014:14:21:39 +0100] conn=2980 op=1 RESULT err=14 tag=97
nentries=0 etime=0, SASL bind in progress
[19/Nov/2014:14:21:39 +0100] conn=2980 op=2 BIND dn=""
method=sasl
version=3 mech=GSSAPI
[19/Nov/2014:14:21:39 +0100] conn=2980 op=2 RESULT err=0 tag=97
nentries=0 etime=0 dn="krbprincipalname=xxx"
[19/Nov/2014:14:21:39 +0100] conn=2980 op=3 SRCH base="" scope=0
filter="(objectClass=*)" attrs="supportedControl
supportedExtension"
[19/Nov/2014:14:21:39 +0100] conn=2980 op=3 RESULT err=0 tag=101
nentries=1 etime=0
[19/Nov/2014:14:21:39 +0100] conn=2980 op=4 SRCH base="" scope=0
filter="(objectClass=*)" attrs="supportedControl
supportedExtension"
[19/Nov/2014:14:21:39 +0100] conn=2980 op=4 RESULT err=0 tag=101
nentries=1 etime=0
[19/Nov/2014:14:21:39 +0100] conn=2980 op=5 EXT
oid="2.16.840.1.113730.3.5.12"
name="replication-multimaster-extop"
[19/Nov/2014:14:21:39 +0100] conn=2980 op=5 RESULT err=0 tag=120
nentries=0 etime=0
[19/Nov/2014:14:21:39 +0100] conn=2980 op=6 SRCH base="cn=schema"
scope=0 filter="(objectClass=*)" attrs="nsSchemaCSN"
[19/Nov/2014:14:21:39 +0100] conn=2980 op=6 RESULT err=0 tag=101
nentries=1 etime=0
[19/Nov/2014:14:21:39 +0100] conn=2980 op=-1 fd=78 closed - I/O
function error.
The reason of this closure is logged in server1 error log.
sasl_decode fails to decode a received PDU.
[19/Nov/2014:14:21:39 +0100] - sasl_io_recv failed to decode
packet
for connection 2980
I do not know why it fails but I wonder if the received PDU is
not larger than the maximum configured value. The attribute
nsslapd-maxsasliosize is set to 2Mb by default. Would it be
possible to increase its value (5Mb) to see if it has an impact
[...]
I set nsslapd-maxsasliosize to 6164480 on both machines, but the
problem remains.
The sasl-decode fails but the exact returned value is not logged.
With standard version we may need to attach a debugger and then
set a conditional breakpoint in sasl-decode just after
conn->oparams.decode that will fire if result !=0. Now this can
change the dynamic and possibly prevent the problem to occur
again. The other option is to use an instrumented version to log
this value.
If I understand the mechanism correctly, Server1 needs to have
debug versions of the relevant packages (probably 389-ds-base and
cyrus-sasl) installed in order to track down the problem.
Unfortunately, my Server1 is in production use - if I break it, my
colleagues will grab forks and torches and be after me. A short
downtime would be ok, though.
Is there something else I could do?
Sure I do not want to trigger so much trouble ;-)
I think my email was not clear. To go further we would need to know
the exact reason why sasl_decode fails. I see two options:
* Prepare a debug version, that would report in the error logs the
returned valud of sasl_decode (when it fails). Except downtime to
install the debug version, it has no impact in production.
* Do a debug session (gdb) on Server1. The debug session will
install a breakpoint at a specific place, let the server run,
catch the sasl_decode failure and note the return code, exit from
debugger.
When the problem occurs, it happens regularly (each 5 seconds) so
we should not have to wait long.
That means that debugging Server1 should disturb production for 5
to 10 min.
A detailed procedure to do the debug session is required.
For starters:
http://www.port389.org/docs/389ds/FAQ/faq.html#debugging-crashes
Since this is IPA you will need debuginfo packages for ipa and
slapi-nis in addition to the ones for 389.
Take a look at the Debugging Hangs section where it describes how to
use gdb to get a stack trace. If you can use that gdb command to get
a stack trace with the full debugging symbols (and if you don't know
what that means, just post the redacted stack trace somewhere and
provide us with a link to it), then you should be all ready to do a
gdb session to reproduce the error and "catch it in the act".
Mit freundlichen Gruessen/With best regards,
--Daniel.
--
Manage your subscription for the Freeipa-users mailing list:
https://www.redhat.com/mailman/listinfo/freeipa-users
Go To http://freeipa.org for more info on the project