I am getting some really hard to understand failures. I have three managed machines, running Ubuntu 14.04.03 --- which normally has Linux kernel 3.13. I run Ansible 1.9.4 on my Mac (OS 10.10.5) to manage them. The managed machines are VMs in a cloud service, on a datacenter subnet, and I use an OpenVPN tunnel between them and my Mac. I have a playbook, and it runs. All is well. BTW: I have no ansible.cfg anywhere (I set the SL_INVENTORY envar to point at the appropriate dynamic inventory program), and none of my SSH config files say anything about TCPKeepAlive or ServerAlive<whatever>.
Then on the managed machines I run (as root) `apt-get install linux-generic-lts-wily`, which upgrades the kernel to 4.2. After that my Ansible is flaky. When I run my playbook, usually one or two of the three managed machines fails the gather-facts; the remainder fail in the first few tasks. Which fails where is variable, so I think this is due to some intermittent problem. Thinking it might be a connectivity flakiness, I created `ansible.cfg` that specifies a timeout of 100, and set `TCPKeepAlive no` and `ServerAliveCountMax 100` in my SSH client config (leaving ServerAliveInterval unspecified everywhere, which I think disables that function). These did not solve the problem. The playbook runs to full failure in under 30 seconds. The failure message shows a truncated reply from the remote machine and then some SSH debugging that ends with a complaint about a broken pipe. Here is an example from the gather-facts step (I passed `-vvvv` to `ansible-playbook`): GATHERING FACTS *************************************************************** <10.140.155.166> ESTABLISH CONNECTION FOR USER: mspreitzer <10.140.155.162> ESTABLISH CONNECTION FOR USER: mspreitzer <10.140.155.166> REMOTE_MODULE setup <10.140.155.161> ESTABLISH CONNECTION FOR USER: mspreitzer <10.140.155.162> REMOTE_MODULE setup <10.140.155.161> REMOTE_MODULE setup <10.140.155.166> EXEC ssh -C -tt -vvv -o ControlMaster=auto -o ControlPersist=60s -o ControlPath="/Users/mspreitz/.ansible/cp/ansible-ssh-%h-%p-%r" -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o User=mspreitzer -o ConnectTimeout=100 10.140.155.166 /bin/sh -c 'mkdir -p $HOME/.ansible/tmp/ansible-tmp-1454784058.99-138766034918573 && chmod a+rx $HOME/.ansible/tmp/ansible-tmp-1454784058.99-138766034918573 && echo $HOME/.ansible/tmp/ansible-tmp-1454784058.99-138766034918573' <10.140.155.162> EXEC ssh -C -tt -vvv -o ControlMaster=auto -o ControlPersist=60s -o ControlPath="/Users/mspreitz/.ansible/cp/ansible-ssh-%h-%p-%r" -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o User=mspreitzer -o ConnectTimeout=100 10.140.155.162 /bin/sh -c 'mkdir -p $HOME/.ansible/tmp/ansible-tmp-1454784058.99-66302778668274 && chmod a+rx $HOME/.ansible/tmp/ansible-tmp-1454784058.99-66302778668274 && echo $HOME/.ansible/tmp/ansible-tmp-1454784058.99-66302778668274' <10.140.155.161> EXEC ssh -C -tt -vvv -o ControlMaster=auto -o ControlPersist=60s -o ControlPath="/Users/mspreitz/.ansible/cp/ansible-ssh-%h-%p-%r" -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o User=mspreitzer -o ConnectTimeout=100 10.140.155.161 /bin/sh -c 'mkdir -p $HOME/.ansible/tmp/ansible-tmp-1454784058.99-221770072514077 && chmod a+rx $HOME/.ansible/tmp/ansible-tmp-1454784058.99-221770072514077 && echo $HOME/.ansible/tmp/ansible-tmp-1454784058.99-221770072514077' <10.140.155.166> PUT /var/folders/cd/d050bcgj7xd6lkbfhnb9bns80000gp/T/tmpFwQwuI TO setup <10.140.155.161> PUT /var/folders/cd/d050bcgj7xd6lkbfhnb9bns80000gp/T/tmpcY2it_ TO setup <10.140.155.162> PUT /var/folders/cd/d050bcgj7xd6lkbfhnb9bns80000gp/T/tmpuXklyy TO setup <10.140.155.166> EXEC ssh -C -tt -vvv -o ControlMaster=auto -o ControlPersist=60s -o ControlPath="/Users/mspreitz/.ansible/cp/ansible-ssh-%h-%p-%r" -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o User=mspreitzer -o ConnectTimeout=100 10.140.155.166 /bin/sh -c 'sudo -k && sudo -H -S -p "[sudo via ansible, key=spxlnvdjayufeyagwywqhtdtptisrnej] password: " -u root /bin/sh -c '"'"'echo BECOME-SUCCESS-spxlnvdjayufeyagwywqhtdtptisrnej; LANG=en_US.UTF-8 LC_CTYPE=en_US.UTF-8 /usr/bin/python setup'"'"'' <10.140.155.161> EXEC ssh -C -tt -vvv -o ControlMaster=auto -o ControlPersist=60s -o ControlPath="/Users/mspreitz/.ansible/cp/ansible-ssh-%h-%p-%r" -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o User=mspreitzer -o ConnectTimeout=100 10.140.155.161 /bin/sh -c 'sudo -k && sudo -H -S -p "[sudo via ansible, key=dprrxybimhltrqziynaikzahzsgdjmlc] password: " -u root /bin/sh -c '"'"'echo BECOME-SUCCESS-dprrxybimhltrqziynaikzahzsgdjmlc; LANG=en_US.UTF-8 LC_CTYPE=en_US.UTF-8 /usr/bin/python setup'"'"'' <10.140.155.162> EXEC ssh -C -tt -vvv -o ControlMaster=auto -o ControlPersist=60s -o ControlPath="/Users/mspreitz/.ansible/cp/ansible-ssh-%h-%p-%r" -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o User=mspreitzer -o ConnectTimeout=100 10.140.155.162 /bin/sh -c 'sudo -k && sudo -H -S -p "[sudo via ansible, key=bnojgeocfbeoxbgakzuluppvtxaviuyr] password: " -u root /bin/sh -c '"'"'echo BECOME-SUCCESS-bnojgeocfbeoxbgakzuluppvtxaviuyr; LANG=en_US.UTF-8 LC_CTYPE=en_US.UTF-8 /usr/bin/python setup'"'"'' failed: [10.140.155.166] => {"failed": true, "parsed": false} BECOME-SUCCESS-spxlnvdjayufeyagwywqhtdtptisrnej {"verbose_override": true, "changed": false, "ansible_facts": {*<lots of gather-facts output here, but it is not complete/>*OpenSSH_6.2p2, OSSLShim 0.9.8r 8 Dec 2011 debug1: Reading configuration data /Users/mspreitz/.ssh/config debug1: /Users/mspreitz/.ssh/config line 8: Applying options for 10.* debug1: Reading configuration data /etc/ssh_config debug1: /etc/ssh_config line 20: Applying options for * debug1: /etc/ssh_config line 102: Applying options for * debug1: auto-mux: Trying existing master debug2: fd 3 setting O_NONBLOCK debug2: mux_client_hello_exchange: master version 4 debug3: mux_client_forwards: request forwardings: 0 local, 0 remote debug3: mux_client_request_session: entering debug3: mux_client_request_alive: entering debug3: mux_client_request_alive: done pid = 9654 debug3: mux_client_request_session: session request sent debug1: mux_client_request_session: master session id: 2 debug3: mux_client_read_packet: read header failed: Broken pipe debug2: Received exit status from master 0 Shared connection to 10.140.155.166 closed. >From the same run, here is a failure at a task for the one machine that succeeded to gather-facts in this case: TASK: [bootstrap | add any dependent repository keys from url] **************** <10.140.155.162> ESTABLISH CONNECTION FOR USER: mspreitzer <10.140.155.162> REMOTE_MODULE apt_key url=http://10.140.132.215:8082/Release.pgp <10.140.155.162> EXEC ssh -C -tt -vvv -o ControlMaster=auto -o ControlPersist=60s -o ControlPath="/Users/mspreitz/.ansible/cp/ansible-ssh-%h-%p-%r" -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o User=mspreitzer -o ConnectTimeout=100 10.140.155.162 /bin/sh -c 'mkdir -p $HOME/.ansible/tmp/ansible-tmp-1454784077.96-58394575390812 && chmod a+rx $HOME/.ansible/tmp/ansible-tmp-1454784077.96-58394575390812 && echo $HOME/.ansible/tmp/ansible-tmp-1454784077.96-58394575390812' <10.140.155.162> PUT /var/folders/cd/d050bcgj7xd6lkbfhnb9bns80000gp/T/tmpxHKer5 TO apt_key <10.140.155.162> EXEC ssh -C -tt -vvv -o ControlMaster=auto -o ControlPersist=60s -o ControlPath="/Users/mspreitz/.ansible/cp/ansible-ssh-%h-%p-%r" -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o User=mspreitzer -o ConnectTimeout=100 10.140.155.162 /bin/sh -c 'sudo -k && sudo -H -S -p "[sudo via ansible, key=jsmmpfgpouhdlwmcvgmwvbhidafftccn] password: " -u root /bin/sh -c '"'"'echo BECOME-SUCCESS-jsmmpfgpouhdlwmcvgmwvbhidafftccn; LANG=en_US.UTF-8 LC_CTYPE=en_US.UTF-8 /usr/bin/python apt_key'"'"'' failed: [10.140.155.162] => (item={'repo': u'deb http://10.140.132.215:8082/ docker main', 'key_url': u'http://10.140.132.215:8082/Release.pgp'}) => {"failed": true, "item": {"key_url": "http://10.140.132.215:8082/Release.pgp", "repo": "deb http://10.140.132.215:8082/ docker main"}, "parsed": false} BECOME-SUCCESS-jsmmpfgpouhdlwmcvgmwvbhidafftccn OpenSSH_6.2p2, OSSLShim 0.9.8r 8 Dec 2011 debug1: Reading configuration data /Users/mspreitz/.ssh/config debug1: /Users/mspreitz/.ssh/config line 8: Applying options for 10.* debug1: Reading configuration data /etc/ssh_config debug1: /etc/ssh_config line 20: Applying options for * debug1: /etc/ssh_config line 102: Applying options for * debug1: auto-mux: Trying existing master debug2: fd 3 setting O_NONBLOCK debug2: mux_client_hello_exchange: master version 4 debug3: mux_client_forwards: request forwardings: 0 local, 0 remote debug3: mux_client_request_session: entering debug3: mux_client_request_alive: entering debug3: mux_client_request_alive: done pid = 9658 debug3: mux_client_request_session: session request sent debug1: mux_client_request_session: master session id: 2 debug3: mux_client_read_packet: read header failed: Broken pipe debug2: Received exit status from master 0 Shared connection to 10.140.155.162 closed. In desperation I captured all packets on the VPN network interface on my Mac during that run, then looked at it with Wireshark. I see one long-lived connection for each of the three managed machines, plus one short-lived connection for 10.140.155.162. I see no end to the long-lived connections. Just to make sure it was not due to a missing buffer flush in the packet capture, I did a `ping` of one managed machine after `ansible-playbook` finished; I see the `ping` in the captured packets. So how can Ansible suffer from truncated replies when the connections stay open? How can the connections stay open after `ansible-playbook` exits? The short-lived connection is even stranger. It comes late in the trace. The client closes it soon after the completion of the three-packet handshake that opens the connection. The client sends no TCP payload at all, not even the "client hello" of TLS. The server sends a "server hello", which prompts the client to respond with a RST. What the heck is going on with that? Thanks for any clues, Mike -- You received this message because you are subscribed to the Google Groups "Ansible Project" group. To unsubscribe from this group and stop receiving emails from it, send an email to ansible-project+unsubscr...@googlegroups.com. To post to this group, send email to ansible-project@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/ansible-project/2420a36a-a5ec-4dbc-b65a-5185793c4c0a%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.