Joachim Keinert writes:
> init_kaprocs: ka_LookupKey (code = 180484) DB not initialized properly?

translat_et can be used to translate transarc style error codes.
According to it, 180484 is (KA).4 = user doesn't exist, or KANOENT.

the error message: ka_LookupKey (code=%d) DB not initialized properly?
is produced when kaserver has tried to initialize the database,
fetched the key for AuthServer.Admin@cell, and got an error.
KANOENT would certainly suggest that the database has somehow gotten
mongo trashed.

> What can be done if we get the above message upon the following command:
> bos getlog -server foxtrott -file AuthLog
>
> We have 5 servers. On the server in question (foxtrott) the unix date of the
> file kaserver.DB0 is the restart date for the server processes. On the other
> machines the date is o.k.. Also, the cksum is different to the other servers.
> The kaserver.DBSYS1 seems to be uptodate.

DB0 is the data file; while DBSYS1 is a transaction log, used for
commit/abort processing.  It is possible for the DB0 file to be
valid even if the checksum doesn't match; there is a magic header
of 64 bytes at the start of the file, & there can be additional "slop" at
the end of the file which won't necessarily match, and thus will throw
the checksum calculation off even if the data contents are logically
the same.  In addition, the 64 byte header at the start of the file
contains the database version -- epoch.version, and thus is likely
to change with each restart even if the data is not changed.
Also, the key for AuthServer.Admin@REALM and krbtkt.REALM@REALM
are automatically changed by the server at fairly frequent intervals.

> Can I simply copy the file from another server and restart the kaserver?

It is preferable to just delete the files, and let ubik fetch it.
If you copy the file by hand, from a *live* machine, you risk
the possibility of a change occuring in the "middle" of the fetch,
resulting in possibly inconsistent data.  This will *probably*
just result in the database version being wrong, and hence
causing ubik to fetch the file anyways, but it's *possible* worse
could happen.  If you can manage to copy both files from a healthy
server to the sick one, when the database is quiescent, that will
work, and may be preferable in certain cases because it will
avoid the "temporary hang while the database is being fetched
upon restart" problem, as ubik copies over the database, and
locks up both the sync site & the slave site until this is
complete.

Before you restart it, it is worth saving a copy of the corrupt
file for offline analysis (especially if your transarc support
representative asks for it).  It would be worth saving a non-corrupt
copy at the same time.  Keep in mind the files contain keys for all
your users and adminstrative instance, and should be treated with hyper
paranoia (ie, don't copy it out to a publically readable AFS
filespace, and pretend anyone who has ever seen any part of
those files will automatically have admin access to your cell.)

The other thing worth doing before restarting, is to fetch
and save rx and ubik information for all your kaservers.
"rxdebug server 7004" and "udebug server 7004" for each
server will accomplish this.  Look for dead connections
between the servers with rxdebug, and many connections
from somewhere to a particular machine.  Using udebug,
check database version #'s, and sync site & non-sync site
status.

It's probable that a restart will fix it all, and you won't
have to do anything further.  However, If the problem repeats, you
may have encountered a bug in kaserver that can bolix up kaserver
database restarts if:
        you have a very large database
        you have many read requests
        you have some fairly constant stream of writes happening
        some major event such as a network partition or server restart
                results in breaking ubik sync.
I know we encountered a problem like this some time back, and I can't
remember if we managed to do a good job coordinating passing our fix
back to transarc.com.

Some quick hacks you might try:
        reducing the # of db servers
        make sure there's enough free disk available for the
                afs databases.  There should be, in general, at least as
                much free space as the size of the largest
                database managed by ubik -- and as much space as
                all the databases combined is even better yet.  The
                backup database is normally the largest database managed
                by ubik.
        increasing the speed of the db servers
        increasing the speed/performance of the network connections
                between the db servers
        temporarily remove the source of the writes during the restart
        temporarily isolating the database servers during the restart

                                -Marcus Watts
                                UM ITD PD&D Umich Systems Group

Reply via email to