-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/42970/
-----------------------------------------------------------

Review request for Ambari, Alejandro Fernandez, Eugene Chekanskiy, and Nate 
Cole.


Bugs: AMBARI-14847
    https://issues.apache.org/jira/browse/AMBARI-14847


Repository: ambari


Description
-------

The alerts framework on each Ambari Agent runs alerts in a threadpool when the 
job triggers. This can cause the following error to randomly appear and the 
alert to go CRITICAL:

{noformat}
 Connection failed to http://nat-rare-21-dvitiiuk-2-5.novalocal:8088 (Execution 
of '/usr/bin/kinit -l 5m -c 
/var/lib/ambari-agent/tmp/web_alert_cc_f3f99363c3b7d1667f1287ce3a35aa52 -kt 
/etc/security/keytabs/spnego.service.keytab 
HTTP/[email protected] > /dev/null' returned 1.

kinit: Internal credentials cache error while storing credentials while getting 
initial credentials)
{noformat}

The alerts would randomly go CRITICAL at the end of their ticket expiration 
time only to become OK again shortly after. 

The cause is that the {{kinit}} command being executed to create new 
credentials cannot be run concurrently for the same user.


Diffs
-----

  ambari-common/src/main/python/resource_management/core/global_lock.py 
PRE-CREATION 
  
ambari-common/src/main/python/resource_management/libraries/functions/curl_krb_request.py
 b42a8a3 
  
ambari-common/src/main/python/resource_management/libraries/functions/hive_check.py
 aacb176 
  
ambari-server/src/main/resources/common-services/HIVE/0.12.0.2.0/package/alerts/alert_hive_metastore.py
 dbf0600 
  
ambari-server/src/main/resources/common-services/HIVE/0.12.0.2.0/package/alerts/alert_webhcat_server.py
 1e95703 
  
ambari-server/src/main/resources/common-services/OOZIE/4.0.0.2.0/package/alerts/alert_check_oozie_server.py
 fcc2d49 
  ambari-server/src/test/python/TestGlobalLock.py PRE-CREATION 

Diff: https://reviews.apache.org/r/42970/diff/


Testing
-------

Deployed to a cluster experiencing the issue.

----------------------------------------------------------------------
Total run:868
Total errors:0
Total failures:0
OK


Thanks,

Jonathan Hurley

Reply via email to