I also meant to ask if you could do a simple `ansible -m ping all` test (before and after changing the ulimit settings), to see if you still see slowness with that simple test or if it is directly related to fact-gathering.
Thanks! On Mon, Sep 29, 2014 at 10:28 AM, James Cammarata <[email protected]> wrote: > Hi Barry, > > One thing I did notice when testing your configuration was that, with my > default ulimit settings, large -f settings were causing similar tracebacks > and failures. In my case setting `ulimit -u 4096` (may also have to do > `ulimit -f 4096`) resolved that issue. I noticed this when using the > "ansible" command vs. "ansible-playbook", the later of which may have been > hidding the underlying issue. > > We are still looking into the host-key checking issue to see if we can > replicate that. > > Thanks! > > > On Wed, Sep 24, 2014 at 2:23 PM, Michael DeHaan <[email protected]> > wrote: > >> Hmm, curious. >> >> Yeah there's not really any extra SSH debug detail in the above. >> >> The error in question occurs in two places - one when the pipe slams shut >> for no good reason, and another when ssh exists with error 255 (aka unknown >> error). >> >> We're still looking into the known hosts awareness question. >> >> >> >> >> >> On Wed, Sep 24, 2014 at 3:17 PM, Barry Morrison <[email protected]> >> wrote: >> >>> Support Request #2904 has known_hosts file attached to it >>> >>> Hopefully this is the pertinent part from failed facts gathering: >>> >>> <server1020.prod.domain> ESTABLISH CONNECTION FOR USER: bmorriso >>> <server1020.prod.domain> REMOTE_MODULE setup CHECKMODE=True >>> <server1020.prod.domain> EXEC ['ssh', '-C', '-vvv', '-o', >>> 'ControlMaster=auto', '-o', 'ControlPersist=5m', '-o', >>> 'ControlPath=/home/bmorriso/.ansible/cp/ansible-ssh-%h-%p-%r', '-o', >>> 'StrictHostKeyChecking=no', '-o', 'Port=3422', '-o', >>> 'KbdInteractiveAuthentication=no', '-o', >>> 'PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey', >>> '-o', 'PasswordAuthentication=no', '-o', 'ConnectTimeout=10', >>> 'server1020.prod.domain', u"/bin/sh -c 'LANG=C LC_CTYPE=C /usr/bin/python'"] >>> ok: [server88.prod.domain] >>> <server3033.prod.domain> ESTABLISH CONNECTION FOR USER: bmorriso >>> <server3033.prod.domain> REMOTE_MODULE setup CHECKMODE=True >>> <server3033.prod.domain> EXEC ['ssh', '-C', '-vvv', '-o', >>> 'ControlMaster=auto', '-o', 'ControlPersist=5m', '-o', >>> 'ControlPath=/home/bmorriso/.ansible/cp/ansible-ssh-%h-%p-%r', '-o', >>> 'StrictHostKeyChecking=no', '-o', 'Port=3422', '-o', >>> 'KbdInteractiveAuthentication=no', '-o', >>> 'PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey', >>> '-o', 'PasswordAuthentication=no', '-o', 'ConnectTimeout=10', >>> 'server3033.prod.domain', u"/bin/sh -c 'LANG=C LC_CTYPE=C /usr/bin/python'"] >>> fatal: [server4214.prod.domain] => SSH Error: data could not be sent to >>> the remote host. Make sure this host can be reached over ssh >>> <server1028.prod.domain> ESTABLISH CONNECTION FOR USER: bmorriso >>> <server1028.prod.domain> REMOTE_MODULE setup CHECKMODE=True >>> <server1028.prod.domain> EXEC ['ssh', '-C', '-vvv', '-o', >>> 'ControlMaster=auto', '-o', 'ControlPersist=5m', '-o', >>> 'ControlPath=/home/bmorriso/.ansible/cp/ansible-ssh-%h-%p-%r', '-o', >>> 'StrictHostKeyChecking=no', '-o', 'Port=3422', '-o', >>> 'KbdInteractiveAuthentication=no', '-o', >>> 'PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey', >>> '-o', 'PasswordAuthentication=no', '-o', 'ConnectTimeout=10', >>> 'server1028.prod.domain', u"/bin/sh -c 'LANG=C LC_CTYPE=C /usr/bin/python'"] >>> fatal: [server1020.prod.domain] => SSH Error: data could not be sent to >>> the remote host. Make sure this host can be reached over ssh >>> >>> >>> specifically server1020.prod.domain >>> >>> And I was trying to get the tasks to fail as they had many times earlier >>> -- nothing failed. Everything worked as expected and completed in 59s. I'm >>> not convinced its fixed, but it's behaving. I'll poke at it in the AM. It >>> was too easy to reproduce earlier. >>> >>> On Tuesday, September 23, 2014 7:43:12 PM UTC-7, Michael DeHaan wrote: >>>> >>>> >>>> >>>> On Tue, Sep 23, 2014 at 10:29 PM, Barry Morrison <[email protected]> >>>> wrote: >>>> >>>>> It is not consistent each attempt as far as which hosts fail. On one >>>>> attempt a server will fail, on the next attempt the same server will not >>>>> fail and if I attempt to gather facts manually after it failed, it is able >>>>> to gather facts successfully each time. >>>>> >>>>> But here is a host that failed on the last attempt: >>>>> >>>>> /usr/bin/ansible server74.prod.domain -m ping -vvvv -c ssh >>>>> <server74.prod.domain> ESTABLISH CONNECTION FOR USER: bmorriso >>>>> <server74.prod.domain> REMOTE_MODULE ping >>>>> <server74.prod.domain> EXEC ['ssh', '-C', '-vvv', '-o', >>>>> 'ControlMaster=auto', '-o', 'ControlPersist=5m', '-o', >>>>> 'ControlPath=/home/bmorriso/.ansible/cp/ansible-ssh-%h-%p-%r', '-o', >>>>> 'StrictHostKeyChecking=no', '-o', 'Port=3422', '-o', >>>>> 'KbdInteractiveAuthentication=no', '-o', 'PreferredAuthentications= >>>>> gssapi-with-mic,gssapi-keyex,hostbased,publickey', '-o', >>>>> 'PasswordAuthentication=no', '-o', 'ConnectTimeout=10', >>>>> 'server74.prod.domain', u"/bin/sh -c 'LANG=C LC_CTYPE=C /usr/bin/python'"] >>>>> server74.prod.domain | success >> { >>>>> "changed": false, >>>>> "ping": "pong" >>>>> } >>>>> >>>> >>>> Ok so this one is successful and SSH debug levels are not helping. I'm >>>> going to need to see one that fails, unfortunately. That may need a >>>> capture from the long run.... >>>> >>>> >>>>> >>>>> >>>>> --forks was set to 50 and I saw ~35 hosts fail >>>>> --forks set to 25, only 5 failed and it ran in 47s >>>>> --forks set to 15, none failed and it ran in 53s >>>>> --forks set to 20, none failed and it ran in 45s >>>>> >>>>> The above are all with host checking off. >>>>> >>>>> Here is another "twist". With --forks passed, if the fact gathering >>>>> doesn't fail, a task will, and like fact gathering before, it's never the >>>>> same task that fails. >>>>> >>>> >>>> >>>> This feels to me like there may be some problem keeping ControlPersist >>>> sockets open. >>>> >>>> One thing to note is they do typically consume about ~1MB per host, >>>> though at -f 50 this shouldn't be a problem. >>>> >>>> Also that version of Ubuntu should be perfectly fine. >>>> >>>> I've occasionally heard of issues with network hardware in the way - a >>>> particularly badly misconfigured switch clamping things down. >>>> >>>> In this particular case, once discovered, the user was soon managing >>>> thousands of nodes at a time. >>>> >>>> Though it's hard to say. More digging is definitely required. >>>> >>>> >>>> >>>> >>>>> >>>>> Task fails with: "ssh connection closed waiting for sudo or su >>>>> password prompt" >>>>> >>>>> >>>>> With host checking on >>>>> >>>>> --forks 25 = 10m >>>>> --forks 50 = 10m >>>>> >>>>> FWIW, If I set forks: 50 in /etc/ansible/ansible.cfg -- it still acts >>>>> as if it is set to 5, only when I pass --forks 50 in the command does it >>>>> actually seem to run at 50. >>>>> >>>> >>>> >>>> This is curious. Possibly a permissions issue on ansible.cfg keeping >>>> it from being read, or the value out of the right section. >>>> [defaults] vs [default] or something is possible. >>>> >>>> If you can email us the file, I'd be interested in seeing it. >>>> >>>> Again, also interested in your known_hosts to try to see if we can tell >>>> why it might not be detecting that your host is in the file. >>>> >>>> That SSH is asking shows it's there, but for some reason Ansible is >>>> thinking it may need to ask you. >>>> >>>> Again, about 65-75% of our users are using these default options vs >>>> paramiko - and haven't heard this reported recently - so hope to get to >>>> the bottom of this. >>>> >>>> Help with the above questions and info would be greatly appreciated! >>>> >>>> >>>>> >>>>> Also, no "old" version of Ansible >>>>> >>>>> which ansible >>>>> /usr/bin/ansible >>>>> >>>>> /usr/bin/ansible --version >>>>> ansible 1.7.1 >>>>> >>>>> Hope this helps, but fear it may add to the confusion. >>>>> >>>>> On Tuesday, September 23, 2014 6:39:47 PM UTC-7, Michael DeHaan wrote: >>>>>> >>>>>> "With it commented, no failures, I'm able to communicate with all >>>>>> servers. " >>>>>> >>>>>> This part is a little interesting. >>>>>> >>>>>> Turning off host checking and going slow you can talk to all your >>>>>> hosts. Going fast you cannot? >>>>>> >>>>>> (If this is repeatable, I wonder if maybe you have an SSH jumphost >>>>>> configured that might be getting overwhelmed? Or perhaps something >>>>>> similar on the network?) >>>>>> >>>>>> Can I ask what --forks is set to? >>>>>> >>>>>> >>>>>> >>>>>> On Tue, Sep 23, 2014 at 9:36 PM, Michael DeHaan <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> Ok Barry, >>>>>>> >>>>>>> We'll get you sorted before you wander off and lose a limb :) >>>>>>> >>>>>>> These things seem to be unrelated. >>>>>>> >>>>>>> (A) >>>>>>> >>>>>>> This has happened in the past when the host key of a host doesn't >>>>>>> *appear* to Ansible's ssh.py connection type to be in the known hosts >>>>>>> file, >>>>>>> and it creates a serial lock to ask you the question about whether it >>>>>>> should be added - but for whatever reason, knew it was actually there. >>>>>>> The result of this is that --forks is not used on the first task per >>>>>>> host, >>>>>>> which makes things not be parallel. It's frustrating. >>>>>>> >>>>>>> This was fixed long ago, when we added knowledge about hashed >>>>>>> known_hosts entries, and should be quite good today, especially on a >>>>>>> well >>>>>>> tested OS like 14.04, basically at the top of our test matrix. >>>>>>> Finding it >>>>>>> again now is curious. >>>>>>> >>>>>>> I'd worry if something else might be interferring with the lock. >>>>>>> My first question is if (maybe privately), we could see your known_hosts >>>>>>> file? >>>>>>> >>>>>>> So we're not quite out of that territory yet with host key checking >>>>>>> on, but I'm still curious about why it may still be doing that. >>>>>>> >>>>>>> There may be a slim chance you're actually using an older ansible >>>>>>> version, or they are hashed weirdly for some reason. >>>>>>> >>>>>>> I'll assume this is happening with "-c ssh". >>>>>>> >>>>>>> (I'd also be curious if this happens on the development branch, but >>>>>>> I don't anticipate any changes there) >>>>>>> >>>>>>> (B) >>>>>>> >>>>>>> On the second question, I'm expecting these 10 hosts are >>>>>>> consistently doing that between runs, as in the same hosts? >>>>>>> >>>>>>> Can I get the result of an /usr/bin/ansible hostname -m ping -vvvv >>>>>>> -c ssh against one of them? >>>>>>> >>>>>>> That will engage SSH debug mode and tell us a little more about what >>>>>>> may be up. >>>>>>> >>>>>>> They could actually be down, but I'm guessing you checked that. >>>>>>> That being returned extraneously is not expected. >>>>>>> >>>>>>> It could also be that ansible_ssh_port or something needs to be set >>>>>>> in inventory or whatever, and it's not normally set, firewall issues, or >>>>>>> things like that? >>>>>>> >>>>>>> Let's start with the "-vvvv" part. >>>>>>> >>>>>>> Thanks! >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Tue, Sep 23, 2014 at 9:20 PM, Barry Morrison <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> Oh, FWIW, I'm touching over 350 servers with this playbook and >>>>>>>> gathering facts from all of them. >>>>>>>> >>>>>>>> On Tuesday, September 23, 2014 6:17:53 PM UTC-7, Barry Morrison >>>>>>>> wrote: >>>>>>>>> >>>>>>>>> Spawned from Conversation with Michael on Twitter >>>>>>>>> https://twitter.com/esacteksab/status/514558427217936384 >>>>>>>>> >>>>>>>>> Uncommenting host_key_checking = False, a playbook runs in 35s >>>>>>>>> Commenting host_key_checking = False, the playbook runs in 9m25s >>>>>>>>> >>>>>>>>> But with it uncommented, ~10% of the servers return: "SSH Error: >>>>>>>>> data could not be sent to the remote host. Make sure this host can be >>>>>>>>> reached over ssh" >>>>>>>>> >>>>>>>>> With it commented, no failures, I'm able to communicate with all >>>>>>>>> servers. >>>>>>>>> >>>>>>>>> This is a topic for to troubleshoot further, because Twitter and >>>>>>>>> 140 chars isn't all that great. >>>>>>>>> >>>>>>>>> Ansible is 1.7.1 on Ubuntu 14.04 >>>>>>>>> Servers are a combination of Ubuntu 12.04 and 14.04 >>>>>>>>> >>>>>>>> -- >>>>>>>> You received this message because you are subscribed to the Google >>>>>>>> Groups "Ansible Project" group. >>>>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>>>> send an email to [email protected]. >>>>>>>> To post to this group, send email to [email protected]. >>>>>>>> To view this discussion on the web visit >>>>>>>> https://groups.google.com/d/msgid/ansible-project/78fd3ef2- >>>>>>>> 1b80-4167-b2f6-99d49569a177%40googlegroups.com >>>>>>>> <https://groups.google.com/d/msgid/ansible-project/78fd3ef2-1b80-4167-b2f6-99d49569a177%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>>> . >>>>>>>> >>>>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>>>> >>>>>>> >>>>>>> >>>>>> -- >>>>> You received this message because you are subscribed to the Google >>>>> Groups "Ansible Project" group. >>>>> To unsubscribe from this group and stop receiving emails from it, send >>>>> an email to [email protected]. >>>>> To post to this group, send email to [email protected]. >>>>> To view this discussion on the web visit https://groups.google.com/d/ >>>>> msgid/ansible-project/819205f6-e110-47d2-a43c- >>>>> 1b93897322f6%40googlegroups.com >>>>> <https://groups.google.com/d/msgid/ansible-project/819205f6-e110-47d2-a43c-1b93897322f6%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>> . >>>>> >>>>> For more options, visit https://groups.google.com/d/optout. >>>>> >>>> >>>> -- >>> You received this message because you are subscribed to the Google >>> Groups "Ansible Project" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to [email protected]. >>> To post to this group, send email to [email protected]. >>> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/ansible-project/327689b5-e7d1-4246-84a6-7f395b31fd1f%40googlegroups.com >>> <https://groups.google.com/d/msgid/ansible-project/327689b5-e7d1-4246-84a6-7f395b31fd1f%40googlegroups.com?utm_medium=email&utm_source=footer> >>> . >>> >>> For more options, visit https://groups.google.com/d/optout. >>> >> >> -- >> You received this message because you are subscribed to the Google Groups >> "Ansible Project" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected]. >> To post to this group, send email to [email protected]. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/ansible-project/CA%2BnsWgz7M1gwcx1e_RDC8Vr%2BuAd96xS5rw5LZMn4AUjEofLmdw%40mail.gmail.com >> <https://groups.google.com/d/msgid/ansible-project/CA%2BnsWgz7M1gwcx1e_RDC8Vr%2BuAd96xS5rw5LZMn4AUjEofLmdw%40mail.gmail.com?utm_medium=email&utm_source=footer> >> . >> >> For more options, visit https://groups.google.com/d/optout. >> > > -- You received this message because you are subscribed to the Google Groups "Ansible Project" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/ansible-project/CAMFyvFh0f0qARoHPbga3iT8GwSzD6iUXdJeA5dezgja-00ajGg%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.
