[ 
https://issues.apache.org/jira/browse/MESOS-1199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14085088#comment-14085088
 ] 

Bernd Mathiske commented on MESOS-1199:
---------------------------------------

This is an attempt to make the best of what we have so far. Using two 
threads/actors.

1. A separate thread that uses *blocking* waitpid(0) to wait for all task 
processes that in the slave's process tree. (Details to be worked out here, 
still drafting)
2. Use all the techniques discussed above for all those task processes that are 
not in the slave's process tree, due to restarting the slave.

Do this simultaneously! At first, the number of processes in category 2 will be 
none. If the slave ever restarts, then the number of processes in category 1 is 
zero, initially. Thereafter, as new tasks are started, we may have both, for a 
while. Eventually, we either have no processes left in category 2 or the 
left-over rest is really long-running and a little delay does not matter for 
them.

How do we know if a task is a current child process or from before restart? We 
can remember the tasks that were recovered in say a set/list and use that for 
iterating over category 2. We exclude every task in that set from category 1.

If this works, then relatively long delays are a temporary thing, not the norm.

I don't know if waitpid(0) does report all terminating processes one by one 
without fail when called successively or whether it skips some. Above, I assume 
it will. The man page seems silent on the issue.


> Subprocess is "slow" -> gated by process::reap poll interval
> ------------------------------------------------------------
>
>                 Key: MESOS-1199
>                 URL: https://issues.apache.org/jira/browse/MESOS-1199
>             Project: Mesos
>          Issue Type: Improvement
>    Affects Versions: 0.18.0
>            Reporter: Ian Downes
>            Assignee: Craig Hansen-Sturm
>         Attachments: wiatpid.pdf
>
>
> Subprocess uses process::reap to wait on the subprocess pid and set the exit 
> status. However, process::reap polls with a one second interval resulting in 
> a delay up to the interval duration before the status future is set.
> This means if you need to wait for the subprocess to complete you get hit 
> with E(delay) = 0.5 seconds, independent of the execution time. For example, 
> the MesosContainerizer uses mesos-fetcher in a Subprocess to fetch the 
> executor during launch. At Twitter we fetch a local file, i.e., a very fast 
> operation, but the launch is blocked until the mesos-fetcher pid is reaped -> 
> adding 0 to 1 seconds for every launch!
> The problem is even worse with a chain of short Subprocesses because after 
> the first Subprocess completes you'll be synchronized with the reap interval 
> and you'll see nearly the full interval before notification, i.e., 10 
> Subprocesses each of << 1 second duration with take ~10 seconds!
> This has become particularly apparent in some new tests I'm working on where 
> test durations are now greatly extended with each taking several seconds.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to