I also meant to ask if you could do a simple `ansible -m ping all` test
(before and after changing the ulimit settings), to see if you still see
slowness with that simple test or if it is directly related to
fact-gathering.

Thanks!

On Mon, Sep 29, 2014 at 10:28 AM, James Cammarata <[email protected]>
wrote:

> Hi Barry,
>
> One thing I did notice when testing your configuration was that, with my
> default ulimit settings, large -f settings were causing similar tracebacks
> and failures. In my case setting `ulimit -u 4096` (may also have to do
> `ulimit -f 4096`) resolved that issue. I noticed this when using the
> "ansible" command vs. "ansible-playbook", the later of which may have been
> hidding the underlying issue.
>
> We are still looking into the host-key checking issue to see if we can
> replicate that.
>
> Thanks!
>
>
> On Wed, Sep 24, 2014 at 2:23 PM, Michael DeHaan <[email protected]>
> wrote:
>
>> Hmm, curious.
>>
>> Yeah there's not really any extra SSH debug detail in the above.
>>
>> The error in question occurs in two places - one when the pipe slams shut
>> for no good reason, and another when ssh exists with error 255 (aka unknown
>> error).
>>
>> We're still looking into the known hosts awareness question.
>>
>>
>>
>>
>>
>> On Wed, Sep 24, 2014 at 3:17 PM, Barry Morrison <[email protected]>
>> wrote:
>>
>>> Support Request #2904 has known_hosts file attached to it
>>>
>>> Hopefully this is the pertinent part from failed facts gathering:
>>>
>>> <server1020.prod.domain> ESTABLISH CONNECTION FOR USER: bmorriso
>>> <server1020.prod.domain> REMOTE_MODULE setup CHECKMODE=True
>>> <server1020.prod.domain> EXEC ['ssh', '-C', '-vvv', '-o',
>>> 'ControlMaster=auto', '-o', 'ControlPersist=5m', '-o',
>>> 'ControlPath=/home/bmorriso/.ansible/cp/ansible-ssh-%h-%p-%r', '-o',
>>> 'StrictHostKeyChecking=no', '-o', 'Port=3422', '-o',
>>> 'KbdInteractiveAuthentication=no', '-o',
>>> 'PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey',
>>> '-o', 'PasswordAuthentication=no', '-o', 'ConnectTimeout=10',
>>> 'server1020.prod.domain', u"/bin/sh -c 'LANG=C LC_CTYPE=C /usr/bin/python'"]
>>> ok: [server88.prod.domain]
>>> <server3033.prod.domain> ESTABLISH CONNECTION FOR USER: bmorriso
>>> <server3033.prod.domain> REMOTE_MODULE setup CHECKMODE=True
>>> <server3033.prod.domain> EXEC ['ssh', '-C', '-vvv', '-o',
>>> 'ControlMaster=auto', '-o', 'ControlPersist=5m', '-o',
>>> 'ControlPath=/home/bmorriso/.ansible/cp/ansible-ssh-%h-%p-%r', '-o',
>>> 'StrictHostKeyChecking=no', '-o', 'Port=3422', '-o',
>>> 'KbdInteractiveAuthentication=no', '-o',
>>> 'PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey',
>>> '-o', 'PasswordAuthentication=no', '-o', 'ConnectTimeout=10',
>>> 'server3033.prod.domain', u"/bin/sh -c 'LANG=C LC_CTYPE=C /usr/bin/python'"]
>>> fatal: [server4214.prod.domain] => SSH Error: data could not be sent to
>>> the remote host. Make sure this host can be reached over ssh
>>> <server1028.prod.domain> ESTABLISH CONNECTION FOR USER: bmorriso
>>> <server1028.prod.domain> REMOTE_MODULE setup CHECKMODE=True
>>> <server1028.prod.domain> EXEC ['ssh', '-C', '-vvv', '-o',
>>> 'ControlMaster=auto', '-o', 'ControlPersist=5m', '-o',
>>> 'ControlPath=/home/bmorriso/.ansible/cp/ansible-ssh-%h-%p-%r', '-o',
>>> 'StrictHostKeyChecking=no', '-o', 'Port=3422', '-o',
>>> 'KbdInteractiveAuthentication=no', '-o',
>>> 'PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey',
>>> '-o', 'PasswordAuthentication=no', '-o', 'ConnectTimeout=10',
>>> 'server1028.prod.domain', u"/bin/sh -c 'LANG=C LC_CTYPE=C /usr/bin/python'"]
>>> fatal: [server1020.prod.domain] => SSH Error: data could not be sent to
>>> the remote host. Make sure this host can be reached over ssh
>>>
>>>
>>> specifically server1020.prod.domain
>>>
>>> And I was trying to get the tasks to fail as they had many times earlier
>>> -- nothing failed. Everything worked as expected and completed in 59s. I'm
>>> not convinced its fixed, but it's behaving. I'll poke at it in the AM. It
>>> was too easy to reproduce earlier.
>>>
>>> On Tuesday, September 23, 2014 7:43:12 PM UTC-7, Michael DeHaan wrote:
>>>>
>>>>
>>>>
>>>> On Tue, Sep 23, 2014 at 10:29 PM, Barry Morrison <[email protected]>
>>>> wrote:
>>>>
>>>>> It is not consistent each attempt as far as which hosts fail. On one
>>>>> attempt a server will fail, on the next attempt the same server will not
>>>>> fail and if I attempt to gather facts manually after it failed, it is able
>>>>> to gather facts successfully each time.
>>>>>
>>>>> But here is a host that failed on the last attempt:
>>>>>
>>>>> /usr/bin/ansible server74.prod.domain -m ping -vvvv -c ssh
>>>>> <server74.prod.domain> ESTABLISH CONNECTION FOR USER: bmorriso
>>>>> <server74.prod.domain> REMOTE_MODULE ping
>>>>> <server74.prod.domain> EXEC ['ssh', '-C', '-vvv', '-o',
>>>>> 'ControlMaster=auto', '-o', 'ControlPersist=5m', '-o',
>>>>> 'ControlPath=/home/bmorriso/.ansible/cp/ansible-ssh-%h-%p-%r', '-o',
>>>>> 'StrictHostKeyChecking=no', '-o', 'Port=3422', '-o',
>>>>> 'KbdInteractiveAuthentication=no', '-o', 'PreferredAuthentications=
>>>>> gssapi-with-mic,gssapi-keyex,hostbased,publickey', '-o',
>>>>> 'PasswordAuthentication=no', '-o', 'ConnectTimeout=10',
>>>>> 'server74.prod.domain', u"/bin/sh -c 'LANG=C LC_CTYPE=C /usr/bin/python'"]
>>>>> server74.prod.domain | success >> {
>>>>>     "changed": false,
>>>>>     "ping": "pong"
>>>>> }
>>>>>
>>>>
>>>> Ok so this one is successful and SSH debug levels are not helping.  I'm
>>>> going to need to see one that fails, unfortunately.  That may need a
>>>> capture from the long run....
>>>>
>>>>
>>>>>
>>>>>
>>>>> --forks was set to 50 and I saw ~35 hosts fail
>>>>> --forks set to 25, only 5 failed and it ran in 47s
>>>>> --forks set to 15, none failed and it ran in 53s
>>>>> --forks set to 20, none failed and it ran in 45s
>>>>>
>>>>> The above are all with host checking off.
>>>>>
>>>>> Here is another "twist". With --forks passed, if the fact gathering
>>>>> doesn't fail, a task will, and like fact gathering before, it's never the
>>>>> same task that fails.
>>>>>
>>>>
>>>>
>>>> This feels to me like there may be some problem keeping ControlPersist
>>>> sockets open.
>>>>
>>>> One thing to note is they do typically consume about ~1MB per host,
>>>> though at -f 50 this shouldn't be a problem.
>>>>
>>>> Also that version of Ubuntu should be perfectly fine.
>>>>
>>>> I've occasionally heard of issues with network hardware in the way - a
>>>> particularly badly misconfigured switch clamping things down.
>>>>
>>>> In this particular case, once discovered, the user was soon managing
>>>> thousands of nodes at a time.
>>>>
>>>> Though it's hard to say.  More digging is definitely required.
>>>>
>>>>
>>>>
>>>>
>>>>>
>>>>> Task fails with: "ssh connection closed waiting for sudo or su
>>>>> password prompt"
>>>>>
>>>>>
>>>>> With host checking on
>>>>>
>>>>> --forks 25 = 10m
>>>>> --forks 50 = 10m
>>>>>
>>>>> FWIW, If I set forks: 50 in /etc/ansible/ansible.cfg -- it still acts
>>>>> as if it is set to 5, only when I pass --forks 50 in the command does it
>>>>> actually seem to run at 50.
>>>>>
>>>>
>>>>
>>>> This is curious.   Possibly a permissions issue on ansible.cfg keeping
>>>> it from being read, or the value out of the right section.
>>>> [defaults] vs [default] or something is possible.
>>>>
>>>> If you can email us the file, I'd be interested in seeing it.
>>>>
>>>> Again, also interested in your known_hosts to try to see if we can tell
>>>> why it might not be detecting that your host is in the file.
>>>>
>>>> That SSH is asking shows it's there, but for some reason Ansible is
>>>> thinking it may need to ask you.
>>>>
>>>> Again, about 65-75% of our users are using these default options vs
>>>> paramiko - and haven't heard this reported recently  - so hope to get to
>>>> the bottom of this.
>>>>
>>>> Help with the above questions and info would be greatly appreciated!
>>>>
>>>>
>>>>>
>>>>> Also, no "old" version of Ansible
>>>>>
>>>>> which ansible
>>>>> /usr/bin/ansible
>>>>>
>>>>> /usr/bin/ansible --version
>>>>> ansible 1.7.1
>>>>>
>>>>> Hope this helps, but fear it may add to the confusion.
>>>>>
>>>>> On Tuesday, September 23, 2014 6:39:47 PM UTC-7, Michael DeHaan wrote:
>>>>>>
>>>>>> "With it commented, no failures, I'm able to communicate with all
>>>>>> servers. "
>>>>>>
>>>>>> This part is a little interesting.
>>>>>>
>>>>>> Turning off host checking and going slow you can talk to all your
>>>>>> hosts.  Going fast you cannot?
>>>>>>
>>>>>> (If this is repeatable, I wonder if maybe you have an SSH jumphost
>>>>>> configured that might be getting overwhelmed?   Or perhaps something
>>>>>> similar on the network?)
>>>>>>
>>>>>> Can I ask what --forks is set to?
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Sep 23, 2014 at 9:36 PM, Michael DeHaan <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Ok Barry,
>>>>>>>
>>>>>>> We'll get you sorted before you wander off and lose a limb :)
>>>>>>>
>>>>>>> These things seem to be unrelated.
>>>>>>>
>>>>>>> (A)
>>>>>>>
>>>>>>> This has happened in the past when the host key of a host doesn't
>>>>>>> *appear* to Ansible's ssh.py connection type to be in the known hosts 
>>>>>>> file,
>>>>>>> and it creates a serial lock to ask you the question about whether it
>>>>>>> should be added - but for whatever reason, knew it was actually there.
>>>>>>> The result of this is that --forks is not used on the first task per 
>>>>>>> host,
>>>>>>> which makes things not be parallel.   It's frustrating.
>>>>>>>
>>>>>>> This was fixed long ago, when we added knowledge about hashed
>>>>>>> known_hosts entries, and should be quite good today, especially on a 
>>>>>>> well
>>>>>>> tested OS like 14.04, basically at the top of our test matrix.   
>>>>>>> Finding it
>>>>>>> again now is curious.
>>>>>>>
>>>>>>> I'd worry if something else might be interferring with the lock.
>>>>>>> My first question is if (maybe privately), we could see your known_hosts
>>>>>>> file?
>>>>>>>
>>>>>>> So we're not quite out of that territory yet with host key checking
>>>>>>> on, but I'm still curious about why it may still be doing that.
>>>>>>>
>>>>>>> There may be a slim chance you're actually using an older ansible
>>>>>>> version, or they are hashed weirdly for some reason.
>>>>>>>
>>>>>>> I'll assume this is happening with "-c ssh".
>>>>>>>
>>>>>>> (I'd also be curious if this happens on the development branch, but
>>>>>>> I don't anticipate any changes there)
>>>>>>>
>>>>>>> (B)
>>>>>>>
>>>>>>> On the second question, I'm expecting these 10 hosts are
>>>>>>> consistently doing that between runs, as in the same hosts?
>>>>>>>
>>>>>>> Can I get the result of an /usr/bin/ansible hostname -m ping -vvvv
>>>>>>> -c ssh against one of them?
>>>>>>>
>>>>>>> That will engage SSH debug mode and tell us a little more about what
>>>>>>> may be up.
>>>>>>>
>>>>>>> They could actually be down, but I'm guessing you checked that.
>>>>>>> That being returned extraneously is not expected.
>>>>>>>
>>>>>>> It could also be that ansible_ssh_port or something needs to be set
>>>>>>> in inventory or whatever, and it's not normally set, firewall issues, or
>>>>>>> things like that?
>>>>>>>
>>>>>>> Let's start with the "-vvvv" part.
>>>>>>>
>>>>>>> Thanks!
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Sep 23, 2014 at 9:20 PM, Barry Morrison <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Oh, FWIW, I'm touching over 350 servers with this playbook and
>>>>>>>> gathering facts from all of them.
>>>>>>>>
>>>>>>>> On Tuesday, September 23, 2014 6:17:53 PM UTC-7, Barry Morrison
>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> Spawned from Conversation with Michael on Twitter
>>>>>>>>> https://twitter.com/esacteksab/status/514558427217936384
>>>>>>>>>
>>>>>>>>> Uncommenting host_key_checking = False, a playbook runs in 35s
>>>>>>>>> Commenting host_key_checking = False, the playbook runs in 9m25s
>>>>>>>>>
>>>>>>>>> But with it uncommented, ~10% of the servers return: "SSH Error:
>>>>>>>>> data could not be sent to the remote host. Make sure this host can be
>>>>>>>>> reached over ssh"
>>>>>>>>>
>>>>>>>>> With it commented, no failures, I'm able to communicate with all
>>>>>>>>> servers.
>>>>>>>>>
>>>>>>>>> This is a topic for to troubleshoot further, because Twitter and
>>>>>>>>> 140 chars isn't all that great.
>>>>>>>>>
>>>>>>>>> Ansible is 1.7.1 on Ubuntu 14.04
>>>>>>>>> Servers are a combination of Ubuntu 12.04 and 14.04
>>>>>>>>>
>>>>>>>>  --
>>>>>>>> You received this message because you are subscribed to the Google
>>>>>>>> Groups "Ansible Project" group.
>>>>>>>> To unsubscribe from this group and stop receiving emails from it,
>>>>>>>> send an email to [email protected].
>>>>>>>> To post to this group, send email to [email protected].
>>>>>>>> To view this discussion on the web visit
>>>>>>>> https://groups.google.com/d/msgid/ansible-project/78fd3ef2-
>>>>>>>> 1b80-4167-b2f6-99d49569a177%40googlegroups.com
>>>>>>>> <https://groups.google.com/d/msgid/ansible-project/78fd3ef2-1b80-4167-b2f6-99d49569a177%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>>> .
>>>>>>>>
>>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>  --
>>>>> You received this message because you are subscribed to the Google
>>>>> Groups "Ansible Project" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>>> an email to [email protected].
>>>>> To post to this group, send email to [email protected].
>>>>> To view this discussion on the web visit https://groups.google.com/d/
>>>>> msgid/ansible-project/819205f6-e110-47d2-a43c-
>>>>> 1b93897322f6%40googlegroups.com
>>>>> <https://groups.google.com/d/msgid/ansible-project/819205f6-e110-47d2-a43c-1b93897322f6%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>> .
>>>>>
>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>
>>>>
>>>>  --
>>> You received this message because you are subscribed to the Google
>>> Groups "Ansible Project" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to [email protected].
>>> To post to this group, send email to [email protected].
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/ansible-project/327689b5-e7d1-4246-84a6-7f395b31fd1f%40googlegroups.com
>>> <https://groups.google.com/d/msgid/ansible-project/327689b5-e7d1-4246-84a6-7f395b31fd1f%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>  --
>> You received this message because you are subscribed to the Google Groups
>> "Ansible Project" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to [email protected].
>> To post to this group, send email to [email protected].
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/ansible-project/CA%2BnsWgz7M1gwcx1e_RDC8Vr%2BuAd96xS5rw5LZMn4AUjEofLmdw%40mail.gmail.com
>> <https://groups.google.com/d/msgid/ansible-project/CA%2BnsWgz7M1gwcx1e_RDC8Vr%2BuAd96xS5rw5LZMn4AUjEofLmdw%40mail.gmail.com?utm_medium=email&utm_source=footer>
>> .
>>
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Ansible Project" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/ansible-project/CAMFyvFh0f0qARoHPbga3iT8GwSzD6iUXdJeA5dezgja-00ajGg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to