I am getting some really hard to understand failures.

I have three managed machines, running Ubuntu 14.04.03 --- which normally 
has Linux kernel 3.13.  I run Ansible 1.9.4 on my Mac (OS 10.10.5) to 
manage them.  The managed machines are VMs in a cloud service, on a 
datacenter subnet, and I use an OpenVPN tunnel between them and my Mac.  I 
have a playbook, and it runs.  All is well.  BTW: I have no ansible.cfg 
anywhere (I set the SL_INVENTORY envar to point at the appropriate dynamic 
inventory program), and none of my SSH config files say anything about 
TCPKeepAlive or ServerAlive<whatever>.

Then on the managed machines I run (as root) `apt-get install 
linux-generic-lts-wily`, which upgrades the kernel to 4.2.  After that my 
Ansible is flaky.  When I run my playbook, usually one or two of the three 
managed machines fails the gather-facts; the remainder fail in the first 
few tasks.  Which fails where is variable, so I think this is due to some 
intermittent problem.

Thinking it might be a connectivity flakiness, I created `ansible.cfg` that 
specifies a timeout of 100, and set `TCPKeepAlive no` and 
`ServerAliveCountMax 100` in my SSH client config (leaving 
ServerAliveInterval unspecified everywhere, which I think disables that 
function).  These did not solve the problem.  The playbook runs to full 
failure in under 30 seconds.

The failure message shows a truncated reply from the remote machine and 
then some SSH debugging that ends with a complaint about a broken pipe.  
Here is an example from the gather-facts step (I passed `-vvvv` to 
`ansible-playbook`):

GATHERING FACTS 
*************************************************************** 
<10.140.155.166> ESTABLISH CONNECTION FOR USER: mspreitzer
<10.140.155.162> ESTABLISH CONNECTION FOR USER: mspreitzer
<10.140.155.166> REMOTE_MODULE setup
<10.140.155.161> ESTABLISH CONNECTION FOR USER: mspreitzer
<10.140.155.162> REMOTE_MODULE setup
<10.140.155.161> REMOTE_MODULE setup
<10.140.155.166> EXEC ssh -C -tt -vvv -o ControlMaster=auto -o 
ControlPersist=60s -o 
ControlPath="/Users/mspreitz/.ansible/cp/ansible-ssh-%h-%p-%r" -o 
KbdInteractiveAuthentication=no -o 
PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey 
-o PasswordAuthentication=no -o User=mspreitzer -o ConnectTimeout=100 
10.140.155.166 /bin/sh -c 'mkdir -p 
$HOME/.ansible/tmp/ansible-tmp-1454784058.99-138766034918573 && chmod a+rx 
$HOME/.ansible/tmp/ansible-tmp-1454784058.99-138766034918573 && echo 
$HOME/.ansible/tmp/ansible-tmp-1454784058.99-138766034918573'
<10.140.155.162> EXEC ssh -C -tt -vvv -o ControlMaster=auto -o 
ControlPersist=60s -o 
ControlPath="/Users/mspreitz/.ansible/cp/ansible-ssh-%h-%p-%r" -o 
KbdInteractiveAuthentication=no -o 
PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey 
-o PasswordAuthentication=no -o User=mspreitzer -o ConnectTimeout=100 
10.140.155.162 /bin/sh -c 'mkdir -p 
$HOME/.ansible/tmp/ansible-tmp-1454784058.99-66302778668274 && chmod a+rx 
$HOME/.ansible/tmp/ansible-tmp-1454784058.99-66302778668274 && echo 
$HOME/.ansible/tmp/ansible-tmp-1454784058.99-66302778668274'
<10.140.155.161> EXEC ssh -C -tt -vvv -o ControlMaster=auto -o 
ControlPersist=60s -o 
ControlPath="/Users/mspreitz/.ansible/cp/ansible-ssh-%h-%p-%r" -o 
KbdInteractiveAuthentication=no -o 
PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey 
-o PasswordAuthentication=no -o User=mspreitzer -o ConnectTimeout=100 
10.140.155.161 /bin/sh -c 'mkdir -p 
$HOME/.ansible/tmp/ansible-tmp-1454784058.99-221770072514077 && chmod a+rx 
$HOME/.ansible/tmp/ansible-tmp-1454784058.99-221770072514077 && echo 
$HOME/.ansible/tmp/ansible-tmp-1454784058.99-221770072514077'
<10.140.155.166> PUT 
/var/folders/cd/d050bcgj7xd6lkbfhnb9bns80000gp/T/tmpFwQwuI TO setup
<10.140.155.161> PUT 
/var/folders/cd/d050bcgj7xd6lkbfhnb9bns80000gp/T/tmpcY2it_ TO setup
<10.140.155.162> PUT 
/var/folders/cd/d050bcgj7xd6lkbfhnb9bns80000gp/T/tmpuXklyy TO setup
<10.140.155.166> EXEC ssh -C -tt -vvv -o ControlMaster=auto -o 
ControlPersist=60s -o 
ControlPath="/Users/mspreitz/.ansible/cp/ansible-ssh-%h-%p-%r" -o 
KbdInteractiveAuthentication=no -o 
PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey 
-o PasswordAuthentication=no -o User=mspreitzer -o ConnectTimeout=100 
10.140.155.166 /bin/sh -c 'sudo -k && sudo -H -S -p "[sudo via ansible, 
key=spxlnvdjayufeyagwywqhtdtptisrnej] password: " -u root /bin/sh -c 
'"'"'echo BECOME-SUCCESS-spxlnvdjayufeyagwywqhtdtptisrnej; LANG=en_US.UTF-8 
LC_CTYPE=en_US.UTF-8 /usr/bin/python setup'"'"''
<10.140.155.161> EXEC ssh -C -tt -vvv -o ControlMaster=auto -o 
ControlPersist=60s -o 
ControlPath="/Users/mspreitz/.ansible/cp/ansible-ssh-%h-%p-%r" -o 
KbdInteractiveAuthentication=no -o 
PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey 
-o PasswordAuthentication=no -o User=mspreitzer -o ConnectTimeout=100 
10.140.155.161 /bin/sh -c 'sudo -k && sudo -H -S -p "[sudo via ansible, 
key=dprrxybimhltrqziynaikzahzsgdjmlc] password: " -u root /bin/sh -c 
'"'"'echo BECOME-SUCCESS-dprrxybimhltrqziynaikzahzsgdjmlc; LANG=en_US.UTF-8 
LC_CTYPE=en_US.UTF-8 /usr/bin/python setup'"'"''
<10.140.155.162> EXEC ssh -C -tt -vvv -o ControlMaster=auto -o 
ControlPersist=60s -o 
ControlPath="/Users/mspreitz/.ansible/cp/ansible-ssh-%h-%p-%r" -o 
KbdInteractiveAuthentication=no -o 
PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey 
-o PasswordAuthentication=no -o User=mspreitzer -o ConnectTimeout=100 
10.140.155.162 /bin/sh -c 'sudo -k && sudo -H -S -p "[sudo via ansible, 
key=bnojgeocfbeoxbgakzuluppvtxaviuyr] password: " -u root /bin/sh -c 
'"'"'echo BECOME-SUCCESS-bnojgeocfbeoxbgakzuluppvtxaviuyr; LANG=en_US.UTF-8 
LC_CTYPE=en_US.UTF-8 /usr/bin/python setup'"'"''
failed: [10.140.155.166] => {"failed": true, "parsed": false}
BECOME-SUCCESS-spxlnvdjayufeyagwywqhtdtptisrnej
{"verbose_override": true, "changed": false, "ansible_facts": {*<lots of 
gather-facts output here, but it is not complete/>*OpenSSH_6.2p2, OSSLShim 
0.9.8r 8 Dec 2011
debug1: Reading configuration data /Users/mspreitz/.ssh/config
debug1: /Users/mspreitz/.ssh/config line 8: Applying options for 10.*
debug1: Reading configuration data /etc/ssh_config
debug1: /etc/ssh_config line 20: Applying options for *
debug1: /etc/ssh_config line 102: Applying options for *
debug1: auto-mux: Trying existing master
debug2: fd 3 setting O_NONBLOCK
debug2: mux_client_hello_exchange: master version 4
debug3: mux_client_forwards: request forwardings: 0 local, 0 remote
debug3: mux_client_request_session: entering
debug3: mux_client_request_alive: entering
debug3: mux_client_request_alive: done pid = 9654
debug3: mux_client_request_session: session request sent
debug1: mux_client_request_session: master session id: 2
debug3: mux_client_read_packet: read header failed: Broken pipe
debug2: Received exit status from master 0
Shared connection to 10.140.155.166 closed.

>From the same run, here is a failure at a task for the one machine that 
succeeded to gather-facts in this case:

TASK: [bootstrap | add any dependent repository keys from url] 
**************** 
<10.140.155.162> ESTABLISH CONNECTION FOR USER: mspreitzer
<10.140.155.162> REMOTE_MODULE apt_key 
url=http://10.140.132.215:8082/Release.pgp
<10.140.155.162> EXEC ssh -C -tt -vvv -o ControlMaster=auto -o 
ControlPersist=60s -o 
ControlPath="/Users/mspreitz/.ansible/cp/ansible-ssh-%h-%p-%r" -o 
KbdInteractiveAuthentication=no -o 
PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey 
-o PasswordAuthentication=no -o User=mspreitzer -o ConnectTimeout=100 
10.140.155.162 /bin/sh -c 'mkdir -p 
$HOME/.ansible/tmp/ansible-tmp-1454784077.96-58394575390812 && chmod a+rx 
$HOME/.ansible/tmp/ansible-tmp-1454784077.96-58394575390812 && echo 
$HOME/.ansible/tmp/ansible-tmp-1454784077.96-58394575390812'
<10.140.155.162> PUT 
/var/folders/cd/d050bcgj7xd6lkbfhnb9bns80000gp/T/tmpxHKer5 TO apt_key
<10.140.155.162> EXEC ssh -C -tt -vvv -o ControlMaster=auto -o 
ControlPersist=60s -o 
ControlPath="/Users/mspreitz/.ansible/cp/ansible-ssh-%h-%p-%r" -o 
KbdInteractiveAuthentication=no -o 
PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey 
-o PasswordAuthentication=no -o User=mspreitzer -o ConnectTimeout=100 
10.140.155.162 /bin/sh -c 'sudo -k && sudo -H -S -p "[sudo via ansible, 
key=jsmmpfgpouhdlwmcvgmwvbhidafftccn] password: " -u root /bin/sh -c 
'"'"'echo BECOME-SUCCESS-jsmmpfgpouhdlwmcvgmwvbhidafftccn; LANG=en_US.UTF-8 
LC_CTYPE=en_US.UTF-8 /usr/bin/python apt_key'"'"''
failed: [10.140.155.162] => (item={'repo': u'deb 
http://10.140.132.215:8082/ docker main', 'key_url': 
u'http://10.140.132.215:8082/Release.pgp'}) => {"failed": true, "item": 
{"key_url": "http://10.140.132.215:8082/Release.pgp";, "repo": "deb 
http://10.140.132.215:8082/ docker main"}, "parsed": false}
BECOME-SUCCESS-jsmmpfgpouhdlwmcvgmwvbhidafftccn
OpenSSH_6.2p2, OSSLShim 0.9.8r 8 Dec 2011
debug1: Reading configuration data /Users/mspreitz/.ssh/config
debug1: /Users/mspreitz/.ssh/config line 8: Applying options for 10.*
debug1: Reading configuration data /etc/ssh_config
debug1: /etc/ssh_config line 20: Applying options for *
debug1: /etc/ssh_config line 102: Applying options for *
debug1: auto-mux: Trying existing master
debug2: fd 3 setting O_NONBLOCK
debug2: mux_client_hello_exchange: master version 4
debug3: mux_client_forwards: request forwardings: 0 local, 0 remote
debug3: mux_client_request_session: entering
debug3: mux_client_request_alive: entering
debug3: mux_client_request_alive: done pid = 9658
debug3: mux_client_request_session: session request sent
debug1: mux_client_request_session: master session id: 2
debug3: mux_client_read_packet: read header failed: Broken pipe
debug2: Received exit status from master 0
Shared connection to 10.140.155.162 closed.


In desperation I captured all packets on the VPN network interface on my 
Mac during that run, then looked at it with Wireshark.  I see one 
long-lived connection for each of the three managed machines, plus one 
short-lived connection for 10.140.155.162.

I see no end to the long-lived connections.  Just to make sure it was not 
due to a missing buffer flush in the packet capture, I did a `ping` of one 
managed machine after `ansible-playbook` finished; I see the `ping` in the 
captured packets.  So how can Ansible suffer from truncated replies when 
the connections stay open?  How can the connections stay open after 
`ansible-playbook` exits?

The short-lived connection is even stranger.  It comes late in the trace.  
The client closes it soon after the completion of the three-packet 
handshake that opens the connection.  The client sends no TCP payload at 
all, not even the "client hello" of TLS.  The server sends a "server 
hello", which prompts the client to respond with a RST.  What the heck is 
going on with that?

Thanks for any clues,
Mike

-- 
You received this message because you are subscribed to the Google Groups 
"Ansible Project" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to ansible-project+unsubscr...@googlegroups.com.
To post to this group, send email to ansible-project@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/ansible-project/2420a36a-a5ec-4dbc-b65a-5185793c4c0a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to