Hi Barry,

One thing I did notice when testing your configuration was that, with my
default ulimit settings, large -f settings were causing similar tracebacks
and failures. In my case setting `ulimit -u 4096` (may also have to do
`ulimit -f 4096`) resolved that issue. I noticed this when using the
"ansible" command vs. "ansible-playbook", the later of which may have been
hidding the underlying issue.

We are still looking into the host-key checking issue to see if we can
replicate that.

Thanks!


On Wed, Sep 24, 2014 at 2:23 PM, Michael DeHaan <[email protected]> wrote:

> Hmm, curious.
>
> Yeah there's not really any extra SSH debug detail in the above.
>
> The error in question occurs in two places - one when the pipe slams shut
> for no good reason, and another when ssh exists with error 255 (aka unknown
> error).
>
> We're still looking into the known hosts awareness question.
>
>
>
>
>
> On Wed, Sep 24, 2014 at 3:17 PM, Barry Morrison <[email protected]>
> wrote:
>
>> Support Request #2904 has known_hosts file attached to it
>>
>> Hopefully this is the pertinent part from failed facts gathering:
>>
>> <server1020.prod.domain> ESTABLISH CONNECTION FOR USER: bmorriso
>> <server1020.prod.domain> REMOTE_MODULE setup CHECKMODE=True
>> <server1020.prod.domain> EXEC ['ssh', '-C', '-vvv', '-o',
>> 'ControlMaster=auto', '-o', 'ControlPersist=5m', '-o',
>> 'ControlPath=/home/bmorriso/.ansible/cp/ansible-ssh-%h-%p-%r', '-o',
>> 'StrictHostKeyChecking=no', '-o', 'Port=3422', '-o',
>> 'KbdInteractiveAuthentication=no', '-o',
>> 'PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey',
>> '-o', 'PasswordAuthentication=no', '-o', 'ConnectTimeout=10',
>> 'server1020.prod.domain', u"/bin/sh -c 'LANG=C LC_CTYPE=C /usr/bin/python'"]
>> ok: [server88.prod.domain]
>> <server3033.prod.domain> ESTABLISH CONNECTION FOR USER: bmorriso
>> <server3033.prod.domain> REMOTE_MODULE setup CHECKMODE=True
>> <server3033.prod.domain> EXEC ['ssh', '-C', '-vvv', '-o',
>> 'ControlMaster=auto', '-o', 'ControlPersist=5m', '-o',
>> 'ControlPath=/home/bmorriso/.ansible/cp/ansible-ssh-%h-%p-%r', '-o',
>> 'StrictHostKeyChecking=no', '-o', 'Port=3422', '-o',
>> 'KbdInteractiveAuthentication=no', '-o',
>> 'PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey',
>> '-o', 'PasswordAuthentication=no', '-o', 'ConnectTimeout=10',
>> 'server3033.prod.domain', u"/bin/sh -c 'LANG=C LC_CTYPE=C /usr/bin/python'"]
>> fatal: [server4214.prod.domain] => SSH Error: data could not be sent to
>> the remote host. Make sure this host can be reached over ssh
>> <server1028.prod.domain> ESTABLISH CONNECTION FOR USER: bmorriso
>> <server1028.prod.domain> REMOTE_MODULE setup CHECKMODE=True
>> <server1028.prod.domain> EXEC ['ssh', '-C', '-vvv', '-o',
>> 'ControlMaster=auto', '-o', 'ControlPersist=5m', '-o',
>> 'ControlPath=/home/bmorriso/.ansible/cp/ansible-ssh-%h-%p-%r', '-o',
>> 'StrictHostKeyChecking=no', '-o', 'Port=3422', '-o',
>> 'KbdInteractiveAuthentication=no', '-o',
>> 'PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey',
>> '-o', 'PasswordAuthentication=no', '-o', 'ConnectTimeout=10',
>> 'server1028.prod.domain', u"/bin/sh -c 'LANG=C LC_CTYPE=C /usr/bin/python'"]
>> fatal: [server1020.prod.domain] => SSH Error: data could not be sent to
>> the remote host. Make sure this host can be reached over ssh
>>
>>
>> specifically server1020.prod.domain
>>
>> And I was trying to get the tasks to fail as they had many times earlier
>> -- nothing failed. Everything worked as expected and completed in 59s. I'm
>> not convinced its fixed, but it's behaving. I'll poke at it in the AM. It
>> was too easy to reproduce earlier.
>>
>> On Tuesday, September 23, 2014 7:43:12 PM UTC-7, Michael DeHaan wrote:
>>>
>>>
>>>
>>> On Tue, Sep 23, 2014 at 10:29 PM, Barry Morrison <[email protected]>
>>> wrote:
>>>
>>>> It is not consistent each attempt as far as which hosts fail. On one
>>>> attempt a server will fail, on the next attempt the same server will not
>>>> fail and if I attempt to gather facts manually after it failed, it is able
>>>> to gather facts successfully each time.
>>>>
>>>> But here is a host that failed on the last attempt:
>>>>
>>>> /usr/bin/ansible server74.prod.domain -m ping -vvvv -c ssh
>>>> <server74.prod.domain> ESTABLISH CONNECTION FOR USER: bmorriso
>>>> <server74.prod.domain> REMOTE_MODULE ping
>>>> <server74.prod.domain> EXEC ['ssh', '-C', '-vvv', '-o',
>>>> 'ControlMaster=auto', '-o', 'ControlPersist=5m', '-o',
>>>> 'ControlPath=/home/bmorriso/.ansible/cp/ansible-ssh-%h-%p-%r', '-o',
>>>> 'StrictHostKeyChecking=no', '-o', 'Port=3422', '-o',
>>>> 'KbdInteractiveAuthentication=no', '-o', 'PreferredAuthentications=
>>>> gssapi-with-mic,gssapi-keyex,hostbased,publickey', '-o',
>>>> 'PasswordAuthentication=no', '-o', 'ConnectTimeout=10',
>>>> 'server74.prod.domain', u"/bin/sh -c 'LANG=C LC_CTYPE=C /usr/bin/python'"]
>>>> server74.prod.domain | success >> {
>>>>     "changed": false,
>>>>     "ping": "pong"
>>>> }
>>>>
>>>
>>> Ok so this one is successful and SSH debug levels are not helping.  I'm
>>> going to need to see one that fails, unfortunately.  That may need a
>>> capture from the long run....
>>>
>>>
>>>>
>>>>
>>>> --forks was set to 50 and I saw ~35 hosts fail
>>>> --forks set to 25, only 5 failed and it ran in 47s
>>>> --forks set to 15, none failed and it ran in 53s
>>>> --forks set to 20, none failed and it ran in 45s
>>>>
>>>> The above are all with host checking off.
>>>>
>>>> Here is another "twist". With --forks passed, if the fact gathering
>>>> doesn't fail, a task will, and like fact gathering before, it's never the
>>>> same task that fails.
>>>>
>>>
>>>
>>> This feels to me like there may be some problem keeping ControlPersist
>>> sockets open.
>>>
>>> One thing to note is they do typically consume about ~1MB per host,
>>> though at -f 50 this shouldn't be a problem.
>>>
>>> Also that version of Ubuntu should be perfectly fine.
>>>
>>> I've occasionally heard of issues with network hardware in the way - a
>>> particularly badly misconfigured switch clamping things down.
>>>
>>> In this particular case, once discovered, the user was soon managing
>>> thousands of nodes at a time.
>>>
>>> Though it's hard to say.  More digging is definitely required.
>>>
>>>
>>>
>>>
>>>>
>>>> Task fails with: "ssh connection closed waiting for sudo or su password
>>>> prompt"
>>>>
>>>>
>>>> With host checking on
>>>>
>>>> --forks 25 = 10m
>>>> --forks 50 = 10m
>>>>
>>>> FWIW, If I set forks: 50 in /etc/ansible/ansible.cfg -- it still acts
>>>> as if it is set to 5, only when I pass --forks 50 in the command does it
>>>> actually seem to run at 50.
>>>>
>>>
>>>
>>> This is curious.   Possibly a permissions issue on ansible.cfg keeping
>>> it from being read, or the value out of the right section.
>>> [defaults] vs [default] or something is possible.
>>>
>>> If you can email us the file, I'd be interested in seeing it.
>>>
>>> Again, also interested in your known_hosts to try to see if we can tell
>>> why it might not be detecting that your host is in the file.
>>>
>>> That SSH is asking shows it's there, but for some reason Ansible is
>>> thinking it may need to ask you.
>>>
>>> Again, about 65-75% of our users are using these default options vs
>>> paramiko - and haven't heard this reported recently  - so hope to get to
>>> the bottom of this.
>>>
>>> Help with the above questions and info would be greatly appreciated!
>>>
>>>
>>>>
>>>> Also, no "old" version of Ansible
>>>>
>>>> which ansible
>>>> /usr/bin/ansible
>>>>
>>>> /usr/bin/ansible --version
>>>> ansible 1.7.1
>>>>
>>>> Hope this helps, but fear it may add to the confusion.
>>>>
>>>> On Tuesday, September 23, 2014 6:39:47 PM UTC-7, Michael DeHaan wrote:
>>>>>
>>>>> "With it commented, no failures, I'm able to communicate with all
>>>>> servers. "
>>>>>
>>>>> This part is a little interesting.
>>>>>
>>>>> Turning off host checking and going slow you can talk to all your
>>>>> hosts.  Going fast you cannot?
>>>>>
>>>>> (If this is repeatable, I wonder if maybe you have an SSH jumphost
>>>>> configured that might be getting overwhelmed?   Or perhaps something
>>>>> similar on the network?)
>>>>>
>>>>> Can I ask what --forks is set to?
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Sep 23, 2014 at 9:36 PM, Michael DeHaan <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Ok Barry,
>>>>>>
>>>>>> We'll get you sorted before you wander off and lose a limb :)
>>>>>>
>>>>>> These things seem to be unrelated.
>>>>>>
>>>>>> (A)
>>>>>>
>>>>>> This has happened in the past when the host key of a host doesn't
>>>>>> *appear* to Ansible's ssh.py connection type to be in the known hosts 
>>>>>> file,
>>>>>> and it creates a serial lock to ask you the question about whether it
>>>>>> should be added - but for whatever reason, knew it was actually there.
>>>>>> The result of this is that --forks is not used on the first task per 
>>>>>> host,
>>>>>> which makes things not be parallel.   It's frustrating.
>>>>>>
>>>>>> This was fixed long ago, when we added knowledge about hashed
>>>>>> known_hosts entries, and should be quite good today, especially on a well
>>>>>> tested OS like 14.04, basically at the top of our test matrix.   Finding 
>>>>>> it
>>>>>> again now is curious.
>>>>>>
>>>>>> I'd worry if something else might be interferring with the lock.   My
>>>>>> first question is if (maybe privately), we could see your known_hosts 
>>>>>> file?
>>>>>>
>>>>>> So we're not quite out of that territory yet with host key checking
>>>>>> on, but I'm still curious about why it may still be doing that.
>>>>>>
>>>>>> There may be a slim chance you're actually using an older ansible
>>>>>> version, or they are hashed weirdly for some reason.
>>>>>>
>>>>>> I'll assume this is happening with "-c ssh".
>>>>>>
>>>>>> (I'd also be curious if this happens on the development branch, but I
>>>>>> don't anticipate any changes there)
>>>>>>
>>>>>> (B)
>>>>>>
>>>>>> On the second question, I'm expecting these 10 hosts are consistently
>>>>>> doing that between runs, as in the same hosts?
>>>>>>
>>>>>> Can I get the result of an /usr/bin/ansible hostname -m ping -vvvv -c
>>>>>> ssh against one of them?
>>>>>>
>>>>>> That will engage SSH debug mode and tell us a little more about what
>>>>>> may be up.
>>>>>>
>>>>>> They could actually be down, but I'm guessing you checked that.
>>>>>> That being returned extraneously is not expected.
>>>>>>
>>>>>> It could also be that ansible_ssh_port or something needs to be set
>>>>>> in inventory or whatever, and it's not normally set, firewall issues, or
>>>>>> things like that?
>>>>>>
>>>>>> Let's start with the "-vvvv" part.
>>>>>>
>>>>>> Thanks!
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Sep 23, 2014 at 9:20 PM, Barry Morrison <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Oh, FWIW, I'm touching over 350 servers with this playbook and
>>>>>>> gathering facts from all of them.
>>>>>>>
>>>>>>> On Tuesday, September 23, 2014 6:17:53 PM UTC-7, Barry Morrison
>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Spawned from Conversation with Michael on Twitter
>>>>>>>> https://twitter.com/esacteksab/status/514558427217936384
>>>>>>>>
>>>>>>>> Uncommenting host_key_checking = False, a playbook runs in 35s
>>>>>>>> Commenting host_key_checking = False, the playbook runs in 9m25s
>>>>>>>>
>>>>>>>> But with it uncommented, ~10% of the servers return: "SSH Error:
>>>>>>>> data could not be sent to the remote host. Make sure this host can be
>>>>>>>> reached over ssh"
>>>>>>>>
>>>>>>>> With it commented, no failures, I'm able to communicate with all
>>>>>>>> servers.
>>>>>>>>
>>>>>>>> This is a topic for to troubleshoot further, because Twitter and
>>>>>>>> 140 chars isn't all that great.
>>>>>>>>
>>>>>>>> Ansible is 1.7.1 on Ubuntu 14.04
>>>>>>>> Servers are a combination of Ubuntu 12.04 and 14.04
>>>>>>>>
>>>>>>>  --
>>>>>>> You received this message because you are subscribed to the Google
>>>>>>> Groups "Ansible Project" group.
>>>>>>> To unsubscribe from this group and stop receiving emails from it,
>>>>>>> send an email to [email protected].
>>>>>>> To post to this group, send email to [email protected].
>>>>>>> To view this discussion on the web visit
>>>>>>> https://groups.google.com/d/msgid/ansible-project/78fd3ef2-
>>>>>>> 1b80-4167-b2f6-99d49569a177%40googlegroups.com
>>>>>>> <https://groups.google.com/d/msgid/ansible-project/78fd3ef2-1b80-4167-b2f6-99d49569a177%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>> .
>>>>>>>
>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>>
>>>>>>
>>>>>>
>>>>>  --
>>>> You received this message because you are subscribed to the Google
>>>> Groups "Ansible Project" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>> an email to [email protected].
>>>> To post to this group, send email to [email protected].
>>>> To view this discussion on the web visit https://groups.google.com/d/
>>>> msgid/ansible-project/819205f6-e110-47d2-a43c-
>>>> 1b93897322f6%40googlegroups.com
>>>> <https://groups.google.com/d/msgid/ansible-project/819205f6-e110-47d2-a43c-1b93897322f6%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>> .
>>>>
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>
>>>  --
>> You received this message because you are subscribed to the Google Groups
>> "Ansible Project" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to [email protected].
>> To post to this group, send email to [email protected].
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/ansible-project/327689b5-e7d1-4246-84a6-7f395b31fd1f%40googlegroups.com
>> <https://groups.google.com/d/msgid/ansible-project/327689b5-e7d1-4246-84a6-7f395b31fd1f%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>  --
> You received this message because you are subscribed to the Google Groups
> "Ansible Project" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/ansible-project/CA%2BnsWgz7M1gwcx1e_RDC8Vr%2BuAd96xS5rw5LZMn4AUjEofLmdw%40mail.gmail.com
> <https://groups.google.com/d/msgid/ansible-project/CA%2BnsWgz7M1gwcx1e_RDC8Vr%2BuAd96xS5rw5LZMn4AUjEofLmdw%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
>
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"Ansible Project" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/ansible-project/CAMFyvFh-7UBYmtv1fs15-bz8Te2QZYMqS7bbzD%3DdNBQTh99PCQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to