[ https://issues.apache.org/jira/browse/AMBARI-23866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16481110#comment-16481110 ]
David F. Quiroga commented on AMBARI-23866: ------------------------------------------- We have about 10-20 KDC servers at 3-4 locations across the US. Analysis determined that it took about 1-2 minutes for a new principal to reach all KDC in our environment. Basically started the service check then ldap searched each host (in a code loop) for the new principal. I selected values based on that but would be opening to changing them, in either direction. If the replication is taking more than 150 seconds I think feedback to the users (AKA failure) is fair as that seems like an unhealthy system. > Kerberos Service Check failure due to kinit failure on random node > ------------------------------------------------------------------ > > Key: AMBARI-23866 > URL: https://issues.apache.org/jira/browse/AMBARI-23866 > Project: Ambari > Issue Type: Improvement > Affects Versions: 2.5.2 > Environment: Multiple Kerberos Domain Controllers across multiple > data centers for single realm. > Reporter: David F. Quiroga > Assignee: David F. Quiroga > Priority: Minor > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > We were seeing Kerberos Service checks failures in Ambari. Specifically it > would fail during the first run of the day, succeed on the second, then fail > on the next but succeed if run again and so forth. > Reviewing the operation log, it showed kinit failure from random node(s) > {{kinit: Client XXXX not found in Kerberos database while getting initial > credentials}} > Since AMBARI-9852 > {quote}The service check must perform the following steps: > 1.Create a unique principal in the relevant KDC (server) > 2.Test that the principal can be used to authenticate via kinit (agent) > 3.Destroy the principal (server) > {quote} > Which is a very good check of services. > So what is happening... > In our environment we have multiple Kerberos Domain Controllers across > multiple data centers all providing the same realm. > The creation of a unique principal occurs at a single KDC and is propagated > to the others. > The agents were testing the principal at different KDC, i.e. before it had a > change to propagate. This is why the second service check would succeed. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)