I've been having issues running Ansible against AIX; specifically with the
copy/template modules.
Periodically, copy/template plays will hang; either for a long time (read
hours, as in leave it overnight and it might be completed the next day) or
indefinitely. After reviewing debug output for a number of these instances,
it appears to be an issue that occurs in the sh.py code under runner. The
problem is in the 'checksum' function. Below is an example debug output of
where the copy/template module will hang:
<aix14.mgmt.loc> EXEC ssh -C -tt -vvv -o ControlMaster=auto -o
ControlPersist=60s -o
ControlPath="/home/ansible/.ansible/cp/ansible-ssh-%h-%p-%r" -o Port=22 -o
KbdInteractiveAuthentication=no -o
PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey
-o PasswordAuthentication=no -o ConnectTimeout=10 aix14.mgmt.loc /bin/sh -c
'sudo -k && sudo -H -S -p "[sudo via ansible,
key=sfknwylttinwgjiawaunhugtrjbqdymg] password: " -u root /bin/sh -c
'"'"'echo SUDO-SUCCESS-sfknwylttinwgjiawaunhugtrjbqdymg; rc=flag; [ -r
"/etc/ntp.conf" ] || rc=2; [ -f "/etc/ntp.conf" ] || rc=1; [ -d
"/etc/ntp.conf" ] && rc=3; python -V 2>/dev/null || rc=4; [ x"$rc" !=
"xflag" ] && echo "${rc} /etc/ntp.conf" && exit 0; (python -c
'"'"'"'"'"'"'"'"'import hashlib; print(hashlib.sha1(open("/etc/ntp.conf",
"rb").read()).hexdigest())'"'"'"'"'"'"'"'"' 2>/dev/null) || (python -c
'"'"'"'"'"'"'"'"'import sha; print(sha.sha(open("/etc/ntp.conf",
"rb").read()).hexdigest())'"'"'"'"'"'"'"'"' 2>/dev/null) || (echo "0
/etc/ntp.conf")'"'"''
This will happen during random copy/template plays, not necessarily for the
same file as in the example above. The issue is reproducible, but not
consistently; 1 in 5 runs or more may have the issue. It appears that the
file actually copies over successfully, and then the session hangs. If I
run a "who -u" on the AIX host, and "kill <pid>" the pid of the SSH
session, the playbook will continue on. I can confirm this happens using
SFTP, and with "scp_if_ssh = True". It also happens with "pipelining =
True" configured.
After digging about on the interwebs, I have found a handful references to
issues with the version of python included by IBM as part of the
Linux-for-AIX toolbox. The version we're using is
from http://www.perzl.org/aix/, which doesn't suffer the same issues
(see https://github.com/ansible/ansible-modules-core/issues/80). I tried
substituting 'hashlib.sha1' with 'hashlib._md5', and was able to reproduce
the same hanging issue. As part of some references online to other folks
using Ansible to manage AIX, I've symlink'd /bin/md5sum to /bin/csum; this
also did not fix our issues. I can also periodically reproduce the issue
when running a single ad-hoc ansible command using the copy module.
Below is a truss output from an AIX box where this issue occurs; this is a
truss against the ssh process of the user connected in from Ansible. I'm
by no means an expert at debugging truss output, however, it appears that
the /bin/sh is called, then it forks off a subprocess, which right away
sends a SIGCHLD, and then the process hangs with "close(8)
(sleeping...)". This is where it will hang for a looooonnnnggg time. The
PID that gets forked off (24379542 in the example below), ends up in a
'<defunct>' state.
kwrite(4, "\0\00304 / b i n / s h ".., 776) = 776
kfcntl(7, F_DUPFD, 0x00000000) = 9
kfcntl(7, F_DUPFD, 0x00000000) = 10
sigprocmask(0, 0xF02B4970, 0xF02B4978) = 0
kfork() = 24379542
thread_setmymask_fast(0x00000000, 0x00000000, 0x00000000, 0xD052A400,
0x00000000, 0x11
960029, 0x00000000) = 0x00000000
Received signal #20, SIGCHLD [caught]
sigprocmask(2, 0xF02B4970, 0x2FF21E80) = 0
_sigaction(20, 0x00000000, 0x2FF21F30) = 0
thread_setmymask_fast(0x00080000, 0x00000000, 0x00000000, 0x11960029,
0x00000003, 0x00
000000, 0x00000000) = 0x00000000
kwrite(6, "\0", 1) = 1
ksetcontext_sigreturn(0x2FF21FE0, 0x2FF22FF8, 0x2002D0D0, 0x0000D032,
0x00000003, 0x00
000000, 0x00000000)
close(8) (sleeping...)
In the interest of disclosing all information, I also notice weird behavior
with the 'w' command when trying to determine if Ansible has an SSH session
open on a host where a playbook is hanging. The 'w' command will hang for
a few seconds when it hits the user logged in and running the Ansible
playbook. When I run a truss against the 'w' command, I get the output
below. the command is getting the status of the user's pts, then it gets a
SIGALRM, which apparently means the system call is taking too long to
respond:
kopen("/dev/pts/4", O_RDONLY|O_NONBLOCK) (sleeping...)
kopen("/dev/pts/4", O_RDONLY|O_NONBLOCK) Err#4 EINTR
Received signal #14, SIGALRM [caught]
_sigaction(14, 0x0FFFFFFFFFFFEEB0, 0x0FFFFFFFFFFFEEE0) = 0
ksetcontext_sigreturn(0x0FFFFFFFFFFFF000, 0x0000000000000000,
0x0FFFFFFFFFFFFFE8, 0x800000000000D032, 0x3FFC000000000003,
0x00000000000000E8, 0x0000000000000000, 0x0000000000000000)
statx("/dev/pts/4", 0x0FFFFFFFFFFFF618, 176, 0) = 0
incinterval(0, 0x0FFFFFFFFFFFF4F8, 0x0FFFFFFFFFFFF518) = 0
statx("/dev/pts", 0x0FFFFFFFFFFFF618, 176, 0) = 0
statx("/dev/pts/4", 0x0FFFFFFFFFFFF638, 176, 0) = 0
ansible pts/4 03:15PM 36 0 0 -
kwrite(1, " a n s i b l e p t s".., 62) = 62
kread(3, "\0\0\0\0\0\0\0\0\0\0\0\0".., 4096) = 1136
_sigaction(14, 0x0FFFFFFFFFFFF4F0, 0x0FFFFFFFFFFFF520) = 0
incinterval(0, 0x0FFFFFFFFFFFF4F8, 0x0FFFFFFFFFFFF518) = 0
My environment is as follows:
Ubuntu 12.04
Ansible 1.8.2 (installed from the Ansible PPA)
AIX 7.1 (have reproduced for sure on TL2SP4, and TL1SP0)
python 2.7.5
--
You received this message because you are subscribed to the Google Groups
"Ansible Project" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/ansible-project/19b02ae3-baeb-4c43-a9dc-a80b8454bf32%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.