Robert Levas commented on AMBARI-23866:



Another thing we noticed is that if the test via kinit fails, the clean-up 
(Destroy the principal) does not happen.


This is done on purpose to help with the replicated KDC issue.  If we fail, we 
do not want to delete the identity since we might have failed because the 
replication has not happened yet.  So the retry button will retry the test 
using the previously created identity. 


> Kerberos Service Check failure due to kinit failure on random node
> ------------------------------------------------------------------
>                 Key: AMBARI-23866
>                 URL: https://issues.apache.org/jira/browse/AMBARI-23866
>             Project: Ambari
>          Issue Type: Improvement
>    Affects Versions: 2.5.2
>         Environment: Multiple Kerberos Domain Controllers across multiple 
> data centers for single realm.
>            Reporter: David F. Quiroga
>            Assignee: David F. Quiroga
>            Priority: Minor
>              Labels: pull-request-available
>          Time Spent: 20m
>  Remaining Estimate: 0h
> We were seeing Kerberos Service checks failures in Ambari. Specifically it 
> would fail during the first run of the day, succeed on the second, then fail 
> on the next but succeed if run again and so forth.
> Reviewing the operation log, it showed kinit failure from random node(s)
>  {{kinit: Client XXXX not found in Kerberos database while getting initial 
> credentials}}
> Since AMBARI-9852
> {quote}The service check must perform the following steps:
>    1.Create a unique principal in the relevant KDC (server)
>    2.Test that the principal can be used to authenticate via kinit (agent)
>    3.Destroy the principal (server)
> {quote}
> Which is a very good check of services.
> So what is happening...
> In our environment we have multiple Kerberos Domain Controllers across 
> multiple data centers all providing the same realm.
> The creation of a unique principal occurs at a single KDC and is propagated 
> to the others.
> The agents were testing the principal at different KDC, i.e. before it had a 
> change to propagate. This is why the second service check would succeed.

This message was sent by Atlassian JIRA

Reply via email to