Am 07.10.2015 um 17:30 schrieb thierry bordaz:
On 10/07/2015 05:03 PM, Dominik Korittki wrote:


Am 07.10.2015 um 15:25 schrieb thierry bordaz:
On 10/07/2015 11:19 AM, Martin Kosek wrote:
On 10/05/2015 02:13 PM, Dominik Korittki wrote:

Am 01.10.2015 um 21:52 schrieb Rob Crittenden:
Dominik Korittki wrote:
Hello folks,

I am running two FreeIPA Servers with around 100 users and around
15.000
hosts, which are used by users to login via ssh. The FreeIPA servers
(which are Centos 7.0) ran good for a while, but as more and more
hosts
got migrated to serve as FreeIPA hosts, it started to get slow and
unstable.

For example, its hard to maintain hostgroups, which have more than
1.000
hosts. The ipa host-* commands are getting slower as the hostgroup
grows. Is this normal?
You mean the ipa hostgroup-* commands? Whenever the entry is
displayed
(show and add) it needs to dereference all members so yes, it is
understandable that it gets somewhat slower with more members. How
slow
are we talking about?

We also experience random dirsrv segfaults. Here's a dmesg line
from the
latest:

[690787.647261] traps: ns-slapd[5217] general protection
ip:7f8d6b6d6bc1
sp:7f8d3aff2a88 error:0 in libc-2.17.so[7f8d6b650000+1b6000]
You probably want to start here:
http://www.port389.org/docs/389ds/FAQ/faq.html#debugging-crashes
A stacktrace from the latest crash is attached to this email. After
restarting
the service, this is what I get in
/var/log/dirsrv/slapd-INTERNAL/errors
(hostname is ipa01.internal):
Ludwig or Thierry, can you please take a look at the stack and file
389-DS
ticket if appropriate?

Hello Dominik,

DS is crashing during a BIND and from the arguments values we can guess
it was due to a heap corruption that corrupted it operation pblock.
This bind operation was likely victim of the heap corruption more than
responsible of it.

Using valgrind is the best way to track such problem but as you already
suffer from bad performance I doubt it would be acceptable.
How frequently does it crash ? did you identify a kind of test case ?

At first the crashes happenend at a daily basis. Simply restarting the
dirsrv daemon resolved the issue for another day but later on the
daemon did not survive more than 15 minutes most of the time. There
were exceptions though. Sometimes the daemon ran for several hours
until it chrashed.
I did not really identify a testcase. However, I supposed it could
have something to do with replication, as I have seen replication
related errors in dirsrv error log (mentioned in an earlier mail in
this topic).
heap corruption are usually dynamic and if the server became more and
more slow, it could change the dynamic in favor of heap corruption.

So did the following:
ipa01 has a replication agreement with ipa02. ipa01 was the one with
segfaults. I removed ipa01 from the replication agreement
(ipa-replica-manage del), did an ipa-server-install --uninstall on
ipa01 and created ipa01 as a replica of ipa02. Since then I did not
experience any crashes (for now).
Instead i'm having trouble rebuilding a clean replication agreement
(old RUV stuff still in database), but thats another story I will
eventually post on the mailinglist as a new topic.

As for valgrind: Never used it before. Is there a handy explanation of
how to use it in combination with 389ds? If I still experience those
crashes and I get it managed to use I could try it out.
You may follow this procedure
http://www.port389.org/docs/389ds/FAQ/faq.html#debugging-memory-growthinvalid-access-with-valgrind
(but remove --leak-check=yes because this is not a leak issue)

thanks
thierry

I experienced segmentation faults again on host ipa01, even after I rebuild the replication topology as described in previous mail. I followed your advice and ran valgrind last evening. Sadly I forgot to remove --leak-check=yes, but I hope the information is still useful to you. If not, I'll do it again without --leak-check=yes.

Running under valgrind, the ns-slapd process needed quiet some time until it openened its ports. You can see this by watching the error logs:

[20/Oct/2015:22:27:41 +0200] - 389-Directory/1.3.1.6 B2014.219.1825 starting up [20/Oct/2015:22:27:42 +0200] - WARNING: userRoot: entry cache size 10485760B is less than db size 142483456B; We recommend to increase the entry cache size nsslapd-cachememsize. [20/Oct/2015:22:27:44 +0200] schema-compat-plugin - warning: no entries set up under cn=computers, cn=compat,dc=internal [20/Oct/2015:23:09:16 +0200] - slapd started. Listening on All Interfaces port 389 for LDAP requests [20/Oct/2015:23:09:16 +0200] - Listening on All Interfaces port 636 for LDAPS requests [20/Oct/2015:23:09:16 +0200] - Listening on /var/run/slapd-INTERNAL.socket for LDAPI requests

I guess that's normal, since running the process through valgrind has a huge performance loss? The daemon crashed about ~ 25 seconds after it has opened it's ports. Here is the valgrind log:
http://pastebin.com/8t9RtB6p

Do you see any suspicious things? Many thanks for your help!


- Dominik



Kind regards,
Dominik Korittki


thanks
thierry
[05/Oct/2015:13:51:30 +0200] - slapd started.  Listening on All
Interfaces port
389 for LDAP requests
[05/Oct/2015:13:51:30 +0200] - Listening on All Interfaces port 636
for LDAPS
requests
[05/Oct/2015:13:51:30 +0200] - Listening on
/var/run/slapd-INTERNAL.socket for
LDAPI requests
[05/Oct/2015:13:51:30 +0200] slapd_ldap_sasl_interactive_bind -
Error: could
not perform interactive bind for id [] mech [GSSAPI]: LDAP error -2
(Local
error) (SASL(-1): generic failure: GSSAPI Error: Unspecified GSS
failure.
Minor code may provide more information (No Kerberos credentials
available))
errno 0 (Success)
[05/Oct/2015:13:51:30 +0200] slapi_ldap_bind - Error: could not
perform
interactive bind for id [] authentication mechanism [GSSAPI]: error
-2 (Local
error)
[05/Oct/2015:13:51:30 +0200] NSMMReplicationPlugin -
agmt="cn=meToipa02.internal" (ipa02:389): Replication bind with
GSSAPI auth
failed: LDAP error -2 (Local error) (SASL(-1): generic failure:
GSSAPI Error:
Unspecified GSS failure.  Minor code may provide more information (No
Kerberos
credentials available))
[05/Oct/2015:13:51:30 +0200] NSMMReplicationPlugin - changelog
program -
agmt="cn=masterAgreement1-ipa02.internal-pki-tomcat" (ipa02:389): CSN
54bea480000000600000 not found, we aren't as up to date, or we purged
[05/Oct/2015:13:51:30 +0200] NSMMReplicationPlugin -
agmt="cn=masterAgreement1-ipa02.internal-pki-tomcat" (ipa02:389):
Data required
to update replica has been purged. The replica must be reinitialized.
[05/Oct/2015:13:51:30 +0200] NSMMReplicationPlugin -
agmt="cn=masterAgreement1-ipa02.internal-pki-tomcat" (ipa02:389):
Incremental
update failed and requires administrator action
[05/Oct/2015:13:51:33 +0200] NSMMReplicationPlugin -
agmt="cn=meToipa02.internal" (ipa02:389): Replication bind with
GSSAPI auth
resumed


These lines are present since a replayed a ldif dump from ipa02 to
ipa01, but i
didn't think that it related to the segfault problem (therefore i
said there
are no related problems in the logfile).

But I am starting to believe that these errors could be in relation
to each other.


Kind regards,
Dominik Korittki



Nothing in /var/log/dirsrv/slapd-INTERNAL/errors, which relates
to the
problem.
Not sure about that anymore.

I'm thinking about migrating to latest CentOS 7 FreeIPA 4, but does
that
solve my problems?

FreeIPA server version is 3.3.3-28.el7.centos
389-ds-base.x86_64 is 1.3.1.6-26.el7_0



Kind regards,
Dominik Korittki















--
Manage your subscription for the Freeipa-users mailing list:
https://www.redhat.com/mailman/listinfo/freeipa-users
Go to http://freeipa.org for more info on the project

Reply via email to