Andrew Onischuk created AMBARI-14017:
----------------------------------------
Summary: Service or component install fails when a non-ambari
apt-get command is running
Key: AMBARI-14017
URL: https://issues.apache.org/jira/browse/AMBARI-14017
Project: Ambari
Issue Type: Bug
Reporter: Andrew Onischuk
Assignee: Andrew Onischuk
Fix For: 2.1.3
PROBLEM
Customer Microsoft Research notes that they routinely run "apt-get check" via
a cron job on their servers to check for broken dependencies. They report this
command may take up to two minutes to complete on various nodes in their
cluster. This command locks the package database via a write lock on
/var/lib/dpkg/lock. During that interval, if Ambari is commanded to install a
new component or perform other maintenance tasks on a cluster node that
require access to the package database, the command will fail. Since the apt-
get check is cron, apparently with some frequency, this represents a problem
for ongoing maintenance, especially in large clusters.
It would be desirable if ambari and/or the agent were more fault tolerant of
locks on the package database.
The stack trace at failure follows
Traceback (most recent call last):
File "/var/lib/ambari-agent/cache/stacks/HDP/2.0.6/hooks/before-
INSTALL/scripts/hook.py", line 37, in <module>
BeforeInstallHook().execute()
File "/usr/lib/python2.6/site-
packages/resource_management/libraries/script/script.py", line 219, in execute
method(env)
File "/var/lib/ambari-agent/cache/stacks/HDP/2.0.6/hooks/before-
INSTALL/scripts/hook.py", line 33, in hook
install_repos()
File "/var/lib/ambari-agent/cache/stacks/HDP/2.0.6/hooks/before-
INSTALL/scripts/repo_initialization.py", line 59, in install_repos
_alter_repo("create", params.repo_info, template)
File "/var/lib/ambari-agent/cache/stacks/HDP/2.0.6/hooks/before-
INSTALL/scripts/repo_initialization.py", line 50, in _alter_repo
components = ubuntu_components, # ubuntu specific
File "/usr/lib/python2.6/site-packages/resource_management/core/base.py", line
154, in __init__
self.env.run()
File "/usr/lib/python2.6/site-
packages/resource_management/core/environment.py", line 152, in run
self.run_action(resource, action)
File "/usr/lib/python2.6/site-
packages/resource_management/core/environment.py", line 118, in run_action
provider_action()
File "/usr/lib/python2.6/site-
packages/resource_management/libraries/providers/repository.py", line 110, in
action_create
retcode, out = checked_call(update_cmd_formatted, sudo=True, quiet=False)
File "/usr/lib/python2.6/site-packages/resource_management/core/shell.py",
line 70, in inner
result = function(command, **kwargs)
File "/usr/lib/python2.6/site-packages/resource_management/core/shell.py",
line 92, in checked_call
tries=tries, try_sleep=try_sleep)
File "/usr/lib/python2.6/site-packages/resource_management/core/shell.py",
line 140, in _call_wrapper
result = _call(command, **kwargs_copy)
File "/usr/lib/python2.6/site-packages/resource_management/core/shell.py",
line 291, in _call
raise Fail(err_msg)
resource_management.core.exceptions.Fail: Execution of 'apt-get update <del>qq
-o Dir::Etc::sourcelist=sources.list.d/HDP.list -o
Dir::Etc::sourceparts=</del> -o APT::Get::List-Cleanup=0' returned 100. W: GPG
error: <http://public-repo-1.hortonworks.com> HDP InRelease: The following
signatures couldn't be verified because the public key is not available:
NO_PUBKEY B9733A7A07513CAD
E: Could not get lock /var/lib/dpkg/lock - open (11: Resource temporarily
unavailable)
E: Unable to lock the administration directory (/var/lib/dpkg/), is another
process using it?
BUSINESS IMPACT
MSFT Research will not manage their cluster with Ambari if this cannot be
fixed by the end of November.
EXPECTED
Ambari retries installations for some period of time
ACTUAL
Ambari fails
SUPPORT ANALYSIS
I created a simple program based on the code at
<http://beej.us/guide/bgipc/output/html/multipage/flocking.html> to write lock
/var/lib/dpkg/lock on command, and then attempted a component install on a new
node in a cluster. The install failed. After removing the lock, the
installation succeeded. This is easily reproduced using a simple C program on
a target node.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)