Currently the health checker is just launched by the executor blindly. This
proposal now plumbs information about the task to the subprocess. We
already have a mechanism to do this, it's the DSL provided to users when
they compose their processes. In that case we compose a command line with
variables like './foo --port={{thermos.ports[http]}}'. When this string is
passed to thermos (the parts of the executor that actually launch the
process), the variables are interpolated and the command string is passed
to $SHELL which then runs the process.

This proposal diverges from the existing setup because now ports are passed
to the subprocess via a completely different method. This proposal is also
limited because only ports are passed to the subprocess. Why not other
information available about the task such as the cluster, role, or
hostname? We can currently pass that information to processes via the DSL
as well by having cmdlines like './process --hostname={{mesos.hostname}}'.

I think removing this inconsistency is good, because the DSL becomes easier
to use and there is less of a support burden when educating users about the

To implement this I see at least two (if not more) approaches:

   - The complex method involves using the code from thermos (the code that
   runs a process, interpolates variables, etc) into the shell health checker
   directly. It's a bit complex but it is doable and it has the most code
   reuse. It's different than using the subprocess32 module, because instead
   of spawning a subprocess, thermos will fork off a 'runner' process which
   then creates the desired process as a subprocess and the runner
   communicates state back to the caller via writing the process state to
   checkpoint files.
   - The alternative approach involves setting shell=True, and using
   the pystachio library (the implementation of the interpolation) to fill in
   the variables and then passing that string to subprocess32. This has a two
   benefits: it is much simpler and this is already done elsewhere in the
   code. The only drawback is that we have to do shell=True to prevent the
   argument issue you were describing, but it was objected upon for security
   reasons and it is a breaking change (although it seems Uber is the only
   user so we can skip over that).

I think the latter approach is preferable and I don't see it being a lot of
extra code. I think we can ignore the security implications from shell=True
for two reasons:

   - The documentation says "[...], unsanitized input from an untrusted
   source, [...]" which is not true here. Aurora assumes that the cmd strings
   provided in the DSL comes from the user which is authenticated by the
   scheduler when submitting the task.
   - The other process strings consumed by the DSL are passed to thermos
   which currently passes it to $SHELL, so we are already open to this kind of

To implement the latter approach you can take a look at
`DistributedCommandRunner` in
`src/main/python/apache/aurora/client/api/` which supports
running a command in the context of an already running task and does
variable interpolation.

Specifically you can see this code:
  def substitute(cls, command, task, cluster, **kw):
    prefix_command = 'cd %s;' % cls.thermos_sandbox(cluster, **kw)
    thermos_namespace = ThermosContext(
    mesos_namespace = MesosContext(instance=task.assignedTask.instanceId)
    command = String(prefix_command + command) % Environment(
    return command.get()

This code takes a command string, the task object and other parameters and
uses the pystachio library to interpolate the values into the string.

I don't think constructing the ThermosContext/MesosContext objects is too
hard in the shell health checker and the interpolation is only two extra

I hope this clears up what I think should be done to this code.

On Tue, Mar 8, 2016 at 3:12 PM, Dmitriy Shirchenko <>

> Ok, sorry, so what exactly are you proposing I do instead of simply
> passing in environment variables?
> What themos code? What do we replace subprocess with?
> I'm new to this code base so I'm still figuring things out so pardon my
> ignorance.
> Do you have strong concerns about this approach or just want it to be
> perfect?
> On Tue, Mar 8, 2016, 12:04 PM Zameer Manji <> wrote:
>> This is an automatically generated e-mail. To reply, visit:
>> On March 8th, 2016, 11:25 a.m. PST, *Zameer Manji* wrote:
>> The code for this approach looks fine to me, but I'm not sure if this 
>> approach is the way to go.
>> Why can't the command for the health checker include 
>> '{{thermos.ports[http]}}' and we can resolve that value before launching the 
>> subprocess? Thats more consistent with the rest of the DSL. Further, using 
>> the mustache variables in the command variable would allow the health 
>> checker process to have access to all of the same information that task 
>> processes have like hostname.
>> For example the command could be '/usr/bin/health_checker 
>> --port-to-check={{thermos.ports[http]}}'.
>> On March 8th, 2016, 11:29 a.m. PST, *Joshua Cohen* wrote:
>> I think this is an excellent point, good catch Zameer!
>> Dmitriy, is there any reason why this approach won't work for you guys?
>> On March 8th, 2016, 11:39 a.m. PST, *Dmitriy Shirchenko* wrote:
>> Yea, I tried using that approach at first. But we need to allocate 10 ports 
>> (mix of HTTP and another RPC protocol(s)) for some services, and our 
>> existing health check scripts at the moment are just simple bash scripts. 
>> Dealing with 10 arguments will become difficult (imagine writing one for 
>> that case), especially since order will matter and the owner will need to 
>> keep track of order in which they are passed in (eg is it an HTTP one, or 
>> some RPC protocol.. oh wait, I thought that was passed in first... dammit I 
>> have to read code in how our internal system massages them since we bypass 
>> aurora client completely to see how it actually works). Environment 
>> variables are just easier to deal with.
>> Does this make sense? Perhaps I could have explained why this approach in 
>> the Summary/Description.
>> On March 8th, 2016, 11:42 a.m. PST, *Zameer Manji* wrote:
>> Using the DSL doesn't mean arguments, you can do something like this:
>> 'HTTP_PORT={{thermos.ports[http]}} RPC_PORT={{thermos.ports[rpc]}} 
>> /usr/bin/health_checker'
>> On March 8th, 2016, 11:49 a.m. PST, *Dmitriy Shirchenko* wrote:
>> Here's the code that runs the command:
>>     cmd = shlex.split(self.cmd)
>>     try:
>>       subprocess.check_call(cmd, timeout=self.timeout_secs)
>> so if self.cmd is 'HTTP_PORT=123 RPC_PORT=234 /usr/bin/health_checker'
>> aws you are suggesting, how would this work? check_call doesn't pass
>> through environment variables w/out shell=True AFAIK. shell=True has
>> security concerns so it's disabled by default.
>> Can you elaborate with with the security concerns?
>> Currently all of the other processes launched by thermos have the cmd string 
>> passed to the shell which is very convinent. I think doing that here would 
>> be useful as well. Infact, we could just replace the subprocess work here by 
>> just re-using the existing thermos code we have which will pass the cmd 
>> string to bash, interpolate variables and setuid, setguid so the health 
>> check process doesn't run a root.
>> - Zameer
>> On March 8th, 2016, 10:32 a.m. PST, Dmitriy Shirchenko wrote:
>> Review request for Aurora, John Sirois, Bill Farner, and Zameer Manji.
>> By Dmitriy Shirchenko.
>> *Updated March 8, 2016, 10:32 a.m.*
>> *Bugs: * AURORA-1622 <>
>> *Repository: * aurora
>> Description
>> Exposing ports to shell health checkers
>> Testing
>> Unit and end to end test.
>> Diffs
>>    - NEWS (b84a94550f93691eba0220afedb2bb4d5e00e6bd)
>>    - docs/
>>    (10702ff4e700b6da7bdd7fd036de442be1eba45c)
>>    - src/main/python/apache/aurora/common/health_check/
>>    (890bf0c5d50d0022c044a37191a2e3145cc6340f)
>>    - src/main/python/apache/aurora/executor/common/
>>    (303972778baa04e9d7dd47fb208fe1427e779976)
>>    - src/test/python/apache/aurora/common/health_check/
>>    (84f717fbf724c11863b4980fd2740dc23fe1404e)
>>    - src/test/python/apache/aurora/executor/common/
>>    (9bebce8f5a26662f58075d7ce881a8bdacb2fe46)
>> View Diff <>

Zameer Manji

Reply via email to