[jira] [Commented] (AMBARI-23866) Kerberos Service Check failure due to kinit failure on random node

David F. Quiroga (JIRA) Fri, 18 May 2018 12:46:27 -0700

    [ 
https://issues.apache.org/jira/browse/AMBARI-23866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16481110#comment-16481110
 ]


David F. Quiroga commented on AMBARI-23866:
-------------------------------------------

We have about 10-20 KDC servers at 3-4 locations across the US. 

Analysis determined that it took about 1-2 minutes for a new principal to reach 
all KDC in our environment. Basically started the service check then ldap 
searched each host (in a code loop) for the new principal. 

I selected values based on that but would be opening to changing them, in 
either direction. 

If the replication is taking more than 150 seconds I think feedback to the 
users (AKA failure) is fair as that seems like an unhealthy system. 

 

> Kerberos Service Check failure due to kinit failure on random node
> ------------------------------------------------------------------
>
>                 Key: AMBARI-23866
>                 URL: https://issues.apache.org/jira/browse/AMBARI-23866
>             Project: Ambari
>          Issue Type: Improvement
>    Affects Versions: 2.5.2
>         Environment: Multiple Kerberos Domain Controllers across multiple 
> data centers for single realm.
>            Reporter: David F. Quiroga
>            Assignee: David F. Quiroga
>            Priority: Minor
>              Labels: pull-request-available
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> We were seeing Kerberos Service checks failures in Ambari. Specifically it 
> would fail during the first run of the day, succeed on the second, then fail 
> on the next but succeed if run again and so forth.
> Reviewing the operation log, it showed kinit failure from random node(s)
>  {{kinit: Client XXXX not found in Kerberos database while getting initial 
> credentials}}
> Since AMBARI-9852
> {quote}The service check must perform the following steps:
>    1.Create a unique principal in the relevant KDC (server)
>    2.Test that the principal can be used to authenticate via kinit (agent)
>    3.Destroy the principal (server)
> {quote}
> Which is a very good check of services.
> So what is happening...
> In our environment we have multiple Kerberos Domain Controllers across 
> multiple data centers all providing the same realm.
> The creation of a unique principal occurs at a single KDC and is propagated 
> to the others.
> The agents were testing the principal at different KDC, i.e. before it had a 
> change to propagate. This is why the second service check would succeed.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (AMBARI-23866) Kerberos Service Check failure due to kinit failure on random node

Reply via email to