Hi.

I have a small webhook app that is kicked of by curl and runs 
ansible-playbook. I have noticed some weirdness where my webhook would show 
the Ansible run as completed successfully, but curl would return 500 or 
504. I have tried my best to debug, but the furthest I could get is 
isolating the check_call running ansible-playbook as the problem.

My coworker noticed defunct ssh processes on the same machine and they seem 
to coincide with ansible-playbook runs. They only go away after restarting 
the webhook app. Since I can't seem to figure out why curl is failing with 
500/504, I thought I'd trying and solve the defunct ssh problem in the 
hopes it is related.

Here is the ansible-playbook call:

    json_data = request.get_json(force=True)

    try:
        app_name = json_data['app_name']
        app_env = json_data['app_env']
    except KeyError:
        return 'Please specify app_name and app_envs', 400

    play = '%s_%s' % (app_name, app_env)
    inventory = 'inventory/%s' % play
    tag = 'deploy'

    try:
        check_call(["ansible-playbook", "-i", inventory, 
"infra_{}.yml".format(play), "--tags", tag],
 cwd=workspace)
    except Exception as e:
        logger.exception(e)
        logger.info(datetime.now())
        return 'Failure. See logs for error.', 500
    else:
        logger.info(datetime.now())
        return 'Success!', 200

It seems that some playbooks result in defunct ssh processes and some 
don't. I can't seem to figure out a difference between the playbooks that 
involve ssh as they are all just running docker containers. This is what I 
find immediately after a run that succeeds, but curl fails with 500/504:

$ ps -ef | grep ssh

root     13925     1  0 Mar07 ?        00:00:14 /usr/sbin/sshd -D
root     17119 13925  0 18:58 ?        00:00:00 sshd: mmorris [priv]
mmorris  17163 17119  0 18:58 ?        00:00:00 sshd: mmorris@pts/1
root     17345 17243  0 19:07 ?        00:00:00 [ssh] <defunct>
root     17346 17243  0 19:07 ?        00:00:00 ssh: /root/.ansible/cp/
ansible-ssh-52.4.115.46-22-root [mux]
root     17478 13925  0 19:09 ?        00:00:00 sshd: mmorris [priv]
mmorris  17521 17478  0 19:09 ?        00:00:00 sshd: mmorris@pts/3

And then after less than 30 seconds, the ansible related process also turns 
defunct:

$ ps -ef | grep ssh
root     13925     1  0 Mar07 ?        00:00:14 /usr/sbin/sshd -D
root     17119 13925  0 18:58 ?        00:00:00 sshd: mmorris [priv]
mmorris  17163 17119  0 18:58 ?        00:00:00 sshd: mmorris@pts/1
root     17345 17243  0 19:07 ?        00:00:00 [ssh] <defunct>
root     17346 17243  0 19:07 ?        00:00:00 [ssh] <defunct>
root     17478 13925  0 19:09 ?        00:00:00 sshd: mmorris [priv]
mmorris  17521 17478  0 19:09 ?        00:00:00 sshd: mmorris@pts/3

This has been causing me headache for a while now as I have CI/CD runs 
failing even though the deploy itself with Ansible is successful. Any 
information or advice for figuring this out would be VERY much appreciated!

Maybe there is something I can do instead of just check_call so that 
whatever is going on with the ssh processes won't effect the exit code 
passed to the app?

-- 
You received this message because you are subscribed to the Google Groups 
"Ansible Project" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/ansible-project/8424c218-2c9b-431c-aacc-c6a2d6d5e6fa%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to