Re: [Freeipa-users] Replication woes

Rob Crittenden Tue, 27 Aug 2013 07:27:44 -0700

Bret Wortman wrote:

Here's a bit more about what I'm seeing today.


My master _is_ serving some DNS, but it appears that it's only serving
those zones that it knew about before all this trouble started 7-10 days
ago. In particular, it can only do reverse DNS on one zone (its own),
but can't serve reverse DNS for any other zones, even those that are in
its database and visible (and enabled) from the web UI.

# nslookup 4.3.2.1 ipamaster
;; connection timed out; no servers could be reached
# nslookup 6.5.2.1 ipamaster
Server:            ipamaster
Address:          10.9.2.1

1.2.5.6.in-addr.arpa
<http://bl-1.com/click/load/VmcPPFAzWmtVM1M-b0169Umw-b0231>     name =
host1.foo.com <http://host1.foo.com>.
#

Is this something that's easily rectified? The logs aren't giving me
anything obviously wrong -- nothing in /var/log/dirsrv-FOO-COM/errors
seems significant; just the same CLEANALLRUV errors I've been seeing for
the past week.

You might try restarting named. At a minimum it is going to log all thezones it manages so you can compare what it thinks it has vs what IPAhas. A pattern might emerge.

You can delete an existing replica and re-create it, but with the 389-dserrors I'm not sure what the repercussions would be, if any. You couldend up with more dead replicas. It could be that all the RUV you haveare because the deletions were done prior to 389-ds adding support forCLEANALLRUV (and it getting into IPA).

rob




_
_
*Bret Wortman*

http://damascusgrp.com/
http://about.me/wortmanbret


On Tue, Aug 27, 2013 at 7:24 AM, Bret Wortman
<bret.wort...@damascusgrp.com <mailto:bret.wort...@damascusgrp.com>> wrote:

    I managed to gather some data for Rich and others to review and
    updated a bug for them about a week ago. Now I am getting a lot of
    internal pressure to resolve our problems and get our infrastructure
    stable again. As of yesterday, our master IPA server would accept
    changes to DNS but isn't actually serving DNS, nor is it pushing
    data to any replicas. The replicas are acting as DNS servers but
    aren't getting any updates, nor can updates be made locally on them.
    Fortunately, we aren't adding users very often, but if anyone's
    password expires soon, I'm worried that I'll have an account lockout
    situation.

    So I'll ask again -- can anyone see a way to preserve just the
    actual DNS and authentication data within IPA while dumping its
    other data (replication and so on), restart it cleanly, verify it's
    working in all respects, and set up the replicas from scratch again?
    I'm hearing rumblings about going back to passwd files and host
    tables (which is what we were doing until about 12 months ago when I
    brought IPA in) and I'd really rather not go back to the stone ages....

    Thanks!




    _
    _
    *Bret Wortman*

    http://damascusgrp.com/
    http://about.me/wortmanbret


    On Tue, Aug 20, 2013 at 11:15 AM, Bret Wortman
    <bret.wort...@damascusgrp.com <mailto:bret.wort...@damascusgrp.com>>
    wrote:

        If I were going to attempt to restore to an old backup, what
        directories/files should I make sure to restore? I've got a
        backup script that tars up:

        /usr/share/ipa
        /usr/lib64/ipa
        /var/lib/pia
        /var/lib/ipa-client
        /var/lib/dirsrv
        /etc

        Is that enough to "roll back" to a few days ago before I started
        down this path? I'm now seeing messages about having the max
        number of CleanAllRUV tasks (4) and not being able to enqueue
        any more. So I'm really stuck now and don't know how soon I can
        get the files requested over to Rich for analysis.


        _
        _
        *Bret Wortman*

        http://damascusgrp.com/
        http://about.me/wortmanbret


        On Tue, Aug 20, 2013 at 9:46 AM, Rich Megginson
        <rmegg...@redhat.com <mailto:rmegg...@redhat.com>> wrote:

            On 08/20/2013 05:55 AM, Bret Wortman wrote:

            Okay, now I'm thinking I need to dump all my replicas and
            start them fresh. My /var/log/slapd-FOO-COM/errors is
            filled with messages like this:

            NSMMReplicationPlugin - changelog program -
            agmt="cn=meTogood1.foo.com <http://meTogood1.foo.com>"
            (good1:389): CSN 520a49640000001d0000 not found, we aren't
            as up to date, or we purged
            agmt="cn=meTogood1.foo.com <http://meTogood1.foo.com>"
            (good1:389) - Can't locate CSN 520a49640000001d0000 in the
            changelog (DB rc=-30988). The consumer may need to be
            reinitialized.

            I assume the "consumer" is the replica, right? At present,
            I have two replicas known to my master that are simply
            gone. Another is there but they can't talk. Three more
            have good communication but I'm getting errors like these.
            Is there a good, clean way to just clobber all the
            replicas and start over without trashing the DNS and other
            identity data that is inside my master and which /is/
            working? Deleting them from the master hasn't been
            working; it tends to hang the master's DNS and other
            services until I Ctrl-C out and "ipactl restart" it.

            I'm afraid to venture out without a net here and make
            things worse....


            This looks like https://fedorahosted.org/389/ticket/47386

            We've never been able to reproduce this in a "controlled"
            environment.

            The original reporter has been able to get this to work in
            some cases by restarting ipa (ipactl restart).

            Before you do that, would you be able to provide some
            information for me?

            On the supplier and consumer:
            ldapsearch -xLLL -D "cn=directory manager" -W -b
            "dc=FOO,dc=COM"
            
'(&(objectclass=nstombstone)(nsuniqueid=ffffffff-ffffffff-ffffffff-ffffffff))'
             > ruv.ldif

            ldapsearch -xLLL -D "cn=directory manager" -W -b "cn=config"
            '(objectclass=nsds5replicationagreement)' > agmt.ldif

            dbscan -f /var/lib/dirsrv/slapd-FOO-COM/cldb/*.db4 | head
            -200 > cldb.txt

            Be sure to obscure any sensitive data in ruv.ldif,
            agmt.ldif, and cldb.txt - you can either attach to
            https://fedorahosted.org/389/ticket/47386 or email to me
            directly.




            _
            _
            *Bret Wortman*

            http://damascusgrp.com/
            http://about.me/wortmanbret


            On Mon, Aug 19, 2013 at 2:21 PM, Bret Wortman
            <bret.wort...@damascusgrp.com
            <mailto:bret.wort...@damascusgrp.com>> wrote:

                On my master (where this error is occurring), I've
                got, in /etc/hosts:

                127.0.0.1 localhost localhost.localdomain
                ::1      localhost localhost.localdomain
                1.2.3.4 ipamaster.foo.net <http://ipamaster.foo.net>
                ipamaster

                So that should be okay, right?

                # host ipamaster.foo.net <http://ipamaster.foo.net>
                ipamaster.foo.net <http://ipamaster.foo.net> has
                address 1.2.3.4
                # host ipamaster
                ipamaster.foo.net <http://ipamaster.foo.net> has
                address 1.2.3.4
                # host localhost
                localhost has address 127.0.0.1
                localhost has IPv6 address ::1
                #

                I checked the other system (the one I can't connect
                to) to be safe, and its /etc/hosts is similarly
                configured. It even has the master listed with its
                correct IP address.



                _
                _
                *Bret Wortman*

                http://damascusgrp.com/
                http://about.me/wortmanbret


                On Mon, Aug 19, 2013 at 2:02 PM, Simo Sorce
                <s...@redhat.com <mailto:s...@redhat.com>> wrote:

                    On Mon, 2013-08-19 at 13:51 -0400, Bret Wortman wrote:
                    > So, any idea how to fix the Kerberos problem?
                    >

                    If your server is trying to get a tgt for
                    ldap/localhost it probably
                    means your /etc/hosts file is broken and has a
                    line like this:

                    1.2.3.4 localhost my.real.name <http://my.real.name>

                    When GSSAPI tries to resolve my.realm.name
                    <http://my.realm.name> it gets back that 'localhost'
                    is the canonical name so it tries to get a TGT
                    with that name and it
                    fails.

                    If /etc/host sis fine then the DNS server may be
                    returning an IP address
                    that later resolves to localhost again.

                    To unbreak make sure that if you have your fully
                    qualified name
                    in /etc/hosts that it is on its own line pointing
                    at the right IP
                    address and where the FQDN name is the first in line:
                    eg:

                    this is ok:
                    1.2.3.4 server.full.name <http://server.full.name>
                    server

                    this is not:
                    1.2.3.4 server server.full.name
                    <http://server.full.name>

                    Simo.
                    >
                    > Bret Wortman
                    >
                    >
                    > http://damascusgrp.com/
                    >
                    > http://about.me/wortmanbret
                    >
                    >
                    >
                    > On Mon, Aug 19, 2013 at 12:19 PM, Bret Wortman
                    > <bret.wort...@damascusgrp.com
                    <mailto:bret.wort...@damascusgrp.com>> wrote:
                    >         ...and I got the web UI, authentication
                    and sudo back via:
                    >
                    >
                    >         # ipactl stop
                    >         # ipactl start
                    >
                    >
                    >         Not sure why that worked, but it did. I
                    was grasping at
                    >         straws, honestly.
                    >
                    >
                    >
                    >
                    >
                    >         Bret Wortman
                    >
                    >
                    > http://damascusgrp.com/
                    >
                    > http://about.me/wortmanbret
                    >
                    >
                    >
                    >
                    >         On Mon, Aug 19, 2013 at 12:18 PM, Bret
                    Wortman
                    >         <bret.wort...@damascusgrp.com
                    <mailto:bret.wort...@damascusgrp.com>> wrote:
                    >                 Digging further, I think this
                    log entry might be the
                    >                 problem between the two servers
                    that aren't talking:
                    >
                    >
                    > slapd_ldap_sasl_interactive_bind - Error: could not
                    >                 perform interactive bind for
                    id[] mech [GSSAPI]: LDAP
                    >                 error -2 (Local error)
                    (SASL(-1): generic failure:
                    >                 GSSAPI Error: Unspecified GSS
                    failure. Minor code may
                    >                 provide more information (Server
                    >                 ldap/localh...@spx.net
                    <mailto:localh...@spx.net> not found in Kerberos
                    >                 database)) errno 2 (No such file
                    or directory)
                    >
                    >
                    >                 Did I build something
                    incorrectly when that server was
                    >                 set up originally?
                    >
                    >
                    >
                    >
                    >
                    >
                    >
                    >                 Bret Wortman
                    >
                    >
                    > http://damascusgrp.com/
                    >
                    > http://about.me/wortmanbret
                    >
                    >
                    >
                    >
                    >                 On Mon, Aug 19, 2013 at 12:02
                    PM, Bret Wortman
                    >                 <bret.wort...@damascusgrp.com
                    <mailto:bret.wort...@damascusgrp.com>> wrote:
                    >                         I ran it on a good
                    master, against a bad one.
                    >                         As in, I ran this
                    command on my master IPA
                    >                         node:
                    >
                    >
                    >                         # ipa-replica-manage del
                    --force bad1.foo.net <http://bad1.foo.net>
                    >                         --cleanup
                    >
                    >
                    >                         Was that wrong? I was
                    trying to delete the bad
                    >                         replica from the master,
                    so I figured the
                    >                         command needed to be run
                    on the master. But
                    >                         again, my master is now
                    in a state where it's
                    >                         not resolving DNS, user
                    logins, or sudo at the
                    >                         very least.
                    >
                    >
                    >                         Oh, and I checked the
                    node that it was
                    >                         complaining about
                    earlier. The network
                    >                         connection to it is the
                    pits, but it's there.
                    >                         And it resolves.
                    >
                    >
                    >
                    >
                    >
                    >                         Bret Wortman
                    >
                    >
                    > http://damascusgrp.com/
                    >
                    > http://about.me/wortmanbret
                    >
                    >
                    >
                    >                         On Mon, Aug 19, 2013 at
                    11:58 AM, Rob
                    > Crittenden <rcrit...@redhat.com
                    <mailto:rcrit...@redhat.com>> wrote:
                    > Rob Crittenden wrote:
                    >       Bret Wortman wrote:
                    >               Well, my master ground
                    >               to a halt and wasn't
                    >               responding. I rebooted
                    >               the
                    >               system and now I can't
                    >               access the web UI or
                    >               ssh to the master
                    >               either. I
                    >               have console access
                    >               but that's it.
                    >
                    >               The services all say
                    >               they're running, but
                    >               the web UI gives an
                    >               "Unknown
                    >               Error" dialog and ssh
                    >               fails with
                    > "ssh_exchange_identification:
                    >               Connection closed by
                    >               remote host" whenever
                    >               I try to ssh to
                    >               ipamaster. I
                    >               think something has
                    >               gone really wrong
                    >               inside my master. Any
                    >               ideas? Even
                    >               after the reboot,
                    >               --cleanup isn't
                    >               helping and just
                    >               hangs.
                    >
                    >               The logfiles end (as
                    >               of the time I ^C'd the
                    >               process) with:
                    >
                    >               NSMMReplicationPlugin
                    >               -
                    >               agmt="cn=meTogood3.spx.net
                    <http://meTogood3.spx.net>
                    >               <http://meTogood3.spx.net>"
                    (good3:389): Replication bind with GSSAPI
                    >               auth failed: LDAP
                    >               error -2 (Local error)
                    >               (SASL(-1): generic
                    >               failure:
                    >               GSSAPI Error:
                    >               Unspecified GSS
                    >               failure. Minor code
                    >               may provide more
                    >               information (Cannot
                    >               determine realm for
                    >               numeric host address))
                    >               NSMMReplicationPlugin
                    >               - CleanAllRUV Task:
                    >               Replica not online
                    >               (agmt="cn=meTogood3.foo.net
                    <http://meTogood3.foo.net>
                    <http://meTogood3.foo.net>" (good3:389))
                    >               NSMMReplicationPlugin
                    >               - CleanAllRUV Task:
                    >               Not all replicas
                    >               online,
                    >               retrying in 160
                    >               seconds...,
                    >
                    >               So it looks like it's
                    >               having trouble talking
                    >               with one of my
                    >               replicas and
                    >               is doggedly trying to
                    >               get the job done. Any
                    >               idea how to get the
                    >               master
                    >               back working again
                    >               while I troubleshoot
                    >               this connectivity
                    >               issue?
                    >
                    >       That suggests a DNS problem,
                    >       and it might explain ssh as
                    >       well depending
                    >       on your configuration.
                    >
                    >
                    > To be clear, you ran --cleanup against
                    > one of the bad masters, not a good
                    > one, right?
                    >
                    > rob
                    >
                    >
                    >
                    >
                    >
                    >
                    >
                    >
                    >
                    > _______________________________________________
                    > Freeipa-users mailing list
                    > Freeipa-users@redhat.com
                    <mailto:Freeipa-users@redhat.com>
                    >
                    https://www.redhat.com/mailman/listinfo/freeipa-users


                    --
                    Simo Sorce * Red Hat, Inc * New York






            _______________________________________________
            Freeipa-users mailing list
            Freeipa-users@redhat.com  <mailto:Freeipa-users@redhat.com>
            https://www.redhat.com/mailman/listinfo/freeipa-users







_______________________________________________
Freeipa-users mailing list
Freeipa-users@redhat.com
https://www.redhat.com/mailman/listinfo/freeipa-users


_______________________________________________
Freeipa-users mailing list
Freeipa-users@redhat.com
https://www.redhat.com/mailman/listinfo/freeipa-users

Re: [Freeipa-users] Replication woes

Reply via email to