Hi Barry, One thing I did notice when testing your configuration was that, with my default ulimit settings, large -f settings were causing similar tracebacks and failures. In my case setting `ulimit -u 4096` (may also have to do `ulimit -f 4096`) resolved that issue. I noticed this when using the "ansible" command vs. "ansible-playbook", the later of which may have been hidding the underlying issue.
We are still looking into the host-key checking issue to see if we can replicate that. Thanks! On Wed, Sep 24, 2014 at 2:23 PM, Michael DeHaan <[email protected]> wrote: > Hmm, curious. > > Yeah there's not really any extra SSH debug detail in the above. > > The error in question occurs in two places - one when the pipe slams shut > for no good reason, and another when ssh exists with error 255 (aka unknown > error). > > We're still looking into the known hosts awareness question. > > > > > > On Wed, Sep 24, 2014 at 3:17 PM, Barry Morrison <[email protected]> > wrote: > >> Support Request #2904 has known_hosts file attached to it >> >> Hopefully this is the pertinent part from failed facts gathering: >> >> <server1020.prod.domain> ESTABLISH CONNECTION FOR USER: bmorriso >> <server1020.prod.domain> REMOTE_MODULE setup CHECKMODE=True >> <server1020.prod.domain> EXEC ['ssh', '-C', '-vvv', '-o', >> 'ControlMaster=auto', '-o', 'ControlPersist=5m', '-o', >> 'ControlPath=/home/bmorriso/.ansible/cp/ansible-ssh-%h-%p-%r', '-o', >> 'StrictHostKeyChecking=no', '-o', 'Port=3422', '-o', >> 'KbdInteractiveAuthentication=no', '-o', >> 'PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey', >> '-o', 'PasswordAuthentication=no', '-o', 'ConnectTimeout=10', >> 'server1020.prod.domain', u"/bin/sh -c 'LANG=C LC_CTYPE=C /usr/bin/python'"] >> ok: [server88.prod.domain] >> <server3033.prod.domain> ESTABLISH CONNECTION FOR USER: bmorriso >> <server3033.prod.domain> REMOTE_MODULE setup CHECKMODE=True >> <server3033.prod.domain> EXEC ['ssh', '-C', '-vvv', '-o', >> 'ControlMaster=auto', '-o', 'ControlPersist=5m', '-o', >> 'ControlPath=/home/bmorriso/.ansible/cp/ansible-ssh-%h-%p-%r', '-o', >> 'StrictHostKeyChecking=no', '-o', 'Port=3422', '-o', >> 'KbdInteractiveAuthentication=no', '-o', >> 'PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey', >> '-o', 'PasswordAuthentication=no', '-o', 'ConnectTimeout=10', >> 'server3033.prod.domain', u"/bin/sh -c 'LANG=C LC_CTYPE=C /usr/bin/python'"] >> fatal: [server4214.prod.domain] => SSH Error: data could not be sent to >> the remote host. Make sure this host can be reached over ssh >> <server1028.prod.domain> ESTABLISH CONNECTION FOR USER: bmorriso >> <server1028.prod.domain> REMOTE_MODULE setup CHECKMODE=True >> <server1028.prod.domain> EXEC ['ssh', '-C', '-vvv', '-o', >> 'ControlMaster=auto', '-o', 'ControlPersist=5m', '-o', >> 'ControlPath=/home/bmorriso/.ansible/cp/ansible-ssh-%h-%p-%r', '-o', >> 'StrictHostKeyChecking=no', '-o', 'Port=3422', '-o', >> 'KbdInteractiveAuthentication=no', '-o', >> 'PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey', >> '-o', 'PasswordAuthentication=no', '-o', 'ConnectTimeout=10', >> 'server1028.prod.domain', u"/bin/sh -c 'LANG=C LC_CTYPE=C /usr/bin/python'"] >> fatal: [server1020.prod.domain] => SSH Error: data could not be sent to >> the remote host. Make sure this host can be reached over ssh >> >> >> specifically server1020.prod.domain >> >> And I was trying to get the tasks to fail as they had many times earlier >> -- nothing failed. Everything worked as expected and completed in 59s. I'm >> not convinced its fixed, but it's behaving. I'll poke at it in the AM. It >> was too easy to reproduce earlier. >> >> On Tuesday, September 23, 2014 7:43:12 PM UTC-7, Michael DeHaan wrote: >>> >>> >>> >>> On Tue, Sep 23, 2014 at 10:29 PM, Barry Morrison <[email protected]> >>> wrote: >>> >>>> It is not consistent each attempt as far as which hosts fail. On one >>>> attempt a server will fail, on the next attempt the same server will not >>>> fail and if I attempt to gather facts manually after it failed, it is able >>>> to gather facts successfully each time. >>>> >>>> But here is a host that failed on the last attempt: >>>> >>>> /usr/bin/ansible server74.prod.domain -m ping -vvvv -c ssh >>>> <server74.prod.domain> ESTABLISH CONNECTION FOR USER: bmorriso >>>> <server74.prod.domain> REMOTE_MODULE ping >>>> <server74.prod.domain> EXEC ['ssh', '-C', '-vvv', '-o', >>>> 'ControlMaster=auto', '-o', 'ControlPersist=5m', '-o', >>>> 'ControlPath=/home/bmorriso/.ansible/cp/ansible-ssh-%h-%p-%r', '-o', >>>> 'StrictHostKeyChecking=no', '-o', 'Port=3422', '-o', >>>> 'KbdInteractiveAuthentication=no', '-o', 'PreferredAuthentications= >>>> gssapi-with-mic,gssapi-keyex,hostbased,publickey', '-o', >>>> 'PasswordAuthentication=no', '-o', 'ConnectTimeout=10', >>>> 'server74.prod.domain', u"/bin/sh -c 'LANG=C LC_CTYPE=C /usr/bin/python'"] >>>> server74.prod.domain | success >> { >>>> "changed": false, >>>> "ping": "pong" >>>> } >>>> >>> >>> Ok so this one is successful and SSH debug levels are not helping. I'm >>> going to need to see one that fails, unfortunately. That may need a >>> capture from the long run.... >>> >>> >>>> >>>> >>>> --forks was set to 50 and I saw ~35 hosts fail >>>> --forks set to 25, only 5 failed and it ran in 47s >>>> --forks set to 15, none failed and it ran in 53s >>>> --forks set to 20, none failed and it ran in 45s >>>> >>>> The above are all with host checking off. >>>> >>>> Here is another "twist". With --forks passed, if the fact gathering >>>> doesn't fail, a task will, and like fact gathering before, it's never the >>>> same task that fails. >>>> >>> >>> >>> This feels to me like there may be some problem keeping ControlPersist >>> sockets open. >>> >>> One thing to note is they do typically consume about ~1MB per host, >>> though at -f 50 this shouldn't be a problem. >>> >>> Also that version of Ubuntu should be perfectly fine. >>> >>> I've occasionally heard of issues with network hardware in the way - a >>> particularly badly misconfigured switch clamping things down. >>> >>> In this particular case, once discovered, the user was soon managing >>> thousands of nodes at a time. >>> >>> Though it's hard to say. More digging is definitely required. >>> >>> >>> >>> >>>> >>>> Task fails with: "ssh connection closed waiting for sudo or su password >>>> prompt" >>>> >>>> >>>> With host checking on >>>> >>>> --forks 25 = 10m >>>> --forks 50 = 10m >>>> >>>> FWIW, If I set forks: 50 in /etc/ansible/ansible.cfg -- it still acts >>>> as if it is set to 5, only when I pass --forks 50 in the command does it >>>> actually seem to run at 50. >>>> >>> >>> >>> This is curious. Possibly a permissions issue on ansible.cfg keeping >>> it from being read, or the value out of the right section. >>> [defaults] vs [default] or something is possible. >>> >>> If you can email us the file, I'd be interested in seeing it. >>> >>> Again, also interested in your known_hosts to try to see if we can tell >>> why it might not be detecting that your host is in the file. >>> >>> That SSH is asking shows it's there, but for some reason Ansible is >>> thinking it may need to ask you. >>> >>> Again, about 65-75% of our users are using these default options vs >>> paramiko - and haven't heard this reported recently - so hope to get to >>> the bottom of this. >>> >>> Help with the above questions and info would be greatly appreciated! >>> >>> >>>> >>>> Also, no "old" version of Ansible >>>> >>>> which ansible >>>> /usr/bin/ansible >>>> >>>> /usr/bin/ansible --version >>>> ansible 1.7.1 >>>> >>>> Hope this helps, but fear it may add to the confusion. >>>> >>>> On Tuesday, September 23, 2014 6:39:47 PM UTC-7, Michael DeHaan wrote: >>>>> >>>>> "With it commented, no failures, I'm able to communicate with all >>>>> servers. " >>>>> >>>>> This part is a little interesting. >>>>> >>>>> Turning off host checking and going slow you can talk to all your >>>>> hosts. Going fast you cannot? >>>>> >>>>> (If this is repeatable, I wonder if maybe you have an SSH jumphost >>>>> configured that might be getting overwhelmed? Or perhaps something >>>>> similar on the network?) >>>>> >>>>> Can I ask what --forks is set to? >>>>> >>>>> >>>>> >>>>> On Tue, Sep 23, 2014 at 9:36 PM, Michael DeHaan <[email protected]> >>>>> wrote: >>>>> >>>>>> Ok Barry, >>>>>> >>>>>> We'll get you sorted before you wander off and lose a limb :) >>>>>> >>>>>> These things seem to be unrelated. >>>>>> >>>>>> (A) >>>>>> >>>>>> This has happened in the past when the host key of a host doesn't >>>>>> *appear* to Ansible's ssh.py connection type to be in the known hosts >>>>>> file, >>>>>> and it creates a serial lock to ask you the question about whether it >>>>>> should be added - but for whatever reason, knew it was actually there. >>>>>> The result of this is that --forks is not used on the first task per >>>>>> host, >>>>>> which makes things not be parallel. It's frustrating. >>>>>> >>>>>> This was fixed long ago, when we added knowledge about hashed >>>>>> known_hosts entries, and should be quite good today, especially on a well >>>>>> tested OS like 14.04, basically at the top of our test matrix. Finding >>>>>> it >>>>>> again now is curious. >>>>>> >>>>>> I'd worry if something else might be interferring with the lock. My >>>>>> first question is if (maybe privately), we could see your known_hosts >>>>>> file? >>>>>> >>>>>> So we're not quite out of that territory yet with host key checking >>>>>> on, but I'm still curious about why it may still be doing that. >>>>>> >>>>>> There may be a slim chance you're actually using an older ansible >>>>>> version, or they are hashed weirdly for some reason. >>>>>> >>>>>> I'll assume this is happening with "-c ssh". >>>>>> >>>>>> (I'd also be curious if this happens on the development branch, but I >>>>>> don't anticipate any changes there) >>>>>> >>>>>> (B) >>>>>> >>>>>> On the second question, I'm expecting these 10 hosts are consistently >>>>>> doing that between runs, as in the same hosts? >>>>>> >>>>>> Can I get the result of an /usr/bin/ansible hostname -m ping -vvvv -c >>>>>> ssh against one of them? >>>>>> >>>>>> That will engage SSH debug mode and tell us a little more about what >>>>>> may be up. >>>>>> >>>>>> They could actually be down, but I'm guessing you checked that. >>>>>> That being returned extraneously is not expected. >>>>>> >>>>>> It could also be that ansible_ssh_port or something needs to be set >>>>>> in inventory or whatever, and it's not normally set, firewall issues, or >>>>>> things like that? >>>>>> >>>>>> Let's start with the "-vvvv" part. >>>>>> >>>>>> Thanks! >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Tue, Sep 23, 2014 at 9:20 PM, Barry Morrison <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> Oh, FWIW, I'm touching over 350 servers with this playbook and >>>>>>> gathering facts from all of them. >>>>>>> >>>>>>> On Tuesday, September 23, 2014 6:17:53 PM UTC-7, Barry Morrison >>>>>>> wrote: >>>>>>>> >>>>>>>> Spawned from Conversation with Michael on Twitter >>>>>>>> https://twitter.com/esacteksab/status/514558427217936384 >>>>>>>> >>>>>>>> Uncommenting host_key_checking = False, a playbook runs in 35s >>>>>>>> Commenting host_key_checking = False, the playbook runs in 9m25s >>>>>>>> >>>>>>>> But with it uncommented, ~10% of the servers return: "SSH Error: >>>>>>>> data could not be sent to the remote host. Make sure this host can be >>>>>>>> reached over ssh" >>>>>>>> >>>>>>>> With it commented, no failures, I'm able to communicate with all >>>>>>>> servers. >>>>>>>> >>>>>>>> This is a topic for to troubleshoot further, because Twitter and >>>>>>>> 140 chars isn't all that great. >>>>>>>> >>>>>>>> Ansible is 1.7.1 on Ubuntu 14.04 >>>>>>>> Servers are a combination of Ubuntu 12.04 and 14.04 >>>>>>>> >>>>>>> -- >>>>>>> You received this message because you are subscribed to the Google >>>>>>> Groups "Ansible Project" group. >>>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>>> send an email to [email protected]. >>>>>>> To post to this group, send email to [email protected]. >>>>>>> To view this discussion on the web visit >>>>>>> https://groups.google.com/d/msgid/ansible-project/78fd3ef2- >>>>>>> 1b80-4167-b2f6-99d49569a177%40googlegroups.com >>>>>>> <https://groups.google.com/d/msgid/ansible-project/78fd3ef2-1b80-4167-b2f6-99d49569a177%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>> . >>>>>>> >>>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>>> >>>>>> >>>>>> >>>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "Ansible Project" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to [email protected]. >>>> To post to this group, send email to [email protected]. >>>> To view this discussion on the web visit https://groups.google.com/d/ >>>> msgid/ansible-project/819205f6-e110-47d2-a43c- >>>> 1b93897322f6%40googlegroups.com >>>> <https://groups.google.com/d/msgid/ansible-project/819205f6-e110-47d2-a43c-1b93897322f6%40googlegroups.com?utm_medium=email&utm_source=footer> >>>> . >>>> >>>> For more options, visit https://groups.google.com/d/optout. >>>> >>> >>> -- >> You received this message because you are subscribed to the Google Groups >> "Ansible Project" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected]. >> To post to this group, send email to [email protected]. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/ansible-project/327689b5-e7d1-4246-84a6-7f395b31fd1f%40googlegroups.com >> <https://groups.google.com/d/msgid/ansible-project/327689b5-e7d1-4246-84a6-7f395b31fd1f%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> >> For more options, visit https://groups.google.com/d/optout. >> > > -- > You received this message because you are subscribed to the Google Groups > "Ansible Project" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To post to this group, send email to [email protected]. > To view this discussion on the web visit > https://groups.google.com/d/msgid/ansible-project/CA%2BnsWgz7M1gwcx1e_RDC8Vr%2BuAd96xS5rw5LZMn4AUjEofLmdw%40mail.gmail.com > <https://groups.google.com/d/msgid/ansible-project/CA%2BnsWgz7M1gwcx1e_RDC8Vr%2BuAd96xS5rw5LZMn4AUjEofLmdw%40mail.gmail.com?utm_medium=email&utm_source=footer> > . > > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "Ansible Project" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/ansible-project/CAMFyvFh-7UBYmtv1fs15-bz8Te2QZYMqS7bbzD%3DdNBQTh99PCQ%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.
