[jira] [Commented] (AMBARI-23866) Kerberos Service Check failure due to kinit failure on random node

David F. Quiroga (JIRA) Thu, 17 May 2018 09:18:12 -0700

    [ 
https://issues.apache.org/jira/browse/AMBARI-23866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16479304#comment-16479304
 ]


David F. Quiroga commented on AMBARI-23866:
-------------------------------------------

[~rlevas] thanks for feedback.

Invalid password should result in {{Preauthentication failed while getting 
initial credentials}}, in this case we are seeing {{Client not found in 
Kerberos database}} which would indicate the principal provided does not exist. 
So I am not sure if the failure would trigger the use of the  {{master-kdc}}.

Over the last year here at work they deployed new Active Directory Domain 
Controllers and retired the old ones. With that we learned that 
{{kerberos-env\ldap_url}} had been to a single AD server rather than the DNS 
name. From that point on we really try to avoid referencing a single AD server. 

RE:  latency of the replication process. I like the retry because if the 
latency is small the service check will not have to wait a maximum time i.e. 
most users are not affected by the addition of the retry. And true, we can't 
guarantee that we are waiting long enough for every environment but if it is 
taking more than 2+ minutes it should be fair to alert on that. 

 

Another thing we noticed is that if the test via kinit fails, the clean-up 
(Destroy the principal) does not happen. Meaning the principals are still out 
in AD and the keytabs are on the clients. Re-running the service check on the 
same day will succeed and clean those up, but that is not an ideal process. 

 

 

 

 

 

 

 

 

 

 

 

> Kerberos Service Check failure due to kinit failure on random node
> ------------------------------------------------------------------
>
>                 Key: AMBARI-23866
>                 URL: https://issues.apache.org/jira/browse/AMBARI-23866
>             Project: Ambari
>          Issue Type: Improvement
>    Affects Versions: 2.5.2
>         Environment: Multiple Kerberos Domain Controllers across multiple 
> data centers for single realm.
>            Reporter: David F. Quiroga
>            Assignee: David F. Quiroga
>            Priority: Minor
>              Labels: pull-request-available
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> We were seeing Kerberos Service checks failures in Ambari. Specifically it 
> would fail during the first run of the day, succeed on the second, then fail 
> on the next but succeed if run again and so forth.
> Reviewing the operation log, it showed kinit failure from random node(s)
>  {{kinit: Client XXXX not found in Kerberos database while getting initial 
> credentials}}
> Since AMBARI-9852
> {quote}The service check must perform the following steps:
>    1.Create a unique principal in the relevant KDC (server)
>    2.Test that the principal can be used to authenticate via kinit (agent)
>    3.Destroy the principal (server)
> {quote}
> Which is a very good check of services.
> So what is happening...
> In our environment we have multiple Kerberos Domain Controllers across 
> multiple data centers all providing the same realm.
> The creation of a unique principal occurs at a single KDC and is propagated 
> to the others.
> The agents were testing the principal at different KDC, i.e. before it had a 
> change to propagate. This is why the second service check would succeed.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (AMBARI-23866) Kerberos Service Check failure due to kinit failure on random node

Reply via email to