[
https://issues.apache.org/jira/browse/MESOS-1199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14082780#comment-14082780
]
Craig Hansen-Sturm edited comment on MESOS-1199 at 8/1/14 7:42 PM:
-------------------------------------------------------------------
Completed testing. Created a new reaping test which collects N-subprocess
notifications after killing each child, while varying the reaping interval I.
Attached color-coded chart shows %CPU utilization when:
I = 2s,1s,500ms,250ms,125ms,63ms,31ms,16ms,8ms,4ms,2ms,1ms
N = 1,4,16,64,256
For example, with 64 child processes (orange curve) and a 250ms reaping
interval, 21% of the CPU is used. At 16ms, the machine is completely
saturated.
The attached chart (waitpid.pdf) demonstrates that no polling mechanism which
blocks on each child pid (in succession) will scale at lower polling intervals
and higher process counts.
That said, I believe this chart demonstrates that we can safely lower the time
to 500 or 250ms if child process count <=64.
I recommend that we immediately do this; however, we still need a general
non-blocking notification mechanism which scales linearly.
Assuming there is consensus on this, I would like to create a seperate JIRA
which addresses lowering the polling interval, and keep this one open for the
longer term solution.
Opinions ?
was (Author: craig-mesos):
Completed testing. Created a new reaping test which collects N-subprocess
notifications after killing each child, while varying the reaping interval I.
Attached color-coded chart shows %CPU utilization when:
I = 2s,1s,500ms,250ms,125ms,63ms,31ms,16ms,8ms,4ms,2ms,1ms
N = 1,4,16,64,256
For example, with 64 child processes (orange curve) and a 250ms reaping
interval, 21% of the CPU is used. At 16ms, the machine is completely
saturated.
This chart demonstrates that no polling mechanism which blocks on each child
pid (in succession) will scale at lower polling intervals and higher process
counts.
That said, I believe this chart demonstrates that we can safely lower the time
to 500 or 250ms if child process count <=64.
I recommend that we immediately do this; however, we still need a general
non-blocking notification mechanism which scales linearly.
Assuming there is consensus on this, I would like to create a seperate JIRA
which addresses lowering the polling interval, and keep this one open for the
longer term solution.
Opinions ?
> Subprocess is "slow" -> gated by process::reap poll interval
> ------------------------------------------------------------
>
> Key: MESOS-1199
> URL: https://issues.apache.org/jira/browse/MESOS-1199
> Project: Mesos
> Issue Type: Improvement
> Affects Versions: 0.18.0
> Reporter: Ian Downes
> Assignee: Craig Hansen-Sturm
> Attachments: wiatpid.pdf
>
>
> Subprocess uses process::reap to wait on the subprocess pid and set the exit
> status. However, process::reap polls with a one second interval resulting in
> a delay up to the interval duration before the status future is set.
> This means if you need to wait for the subprocess to complete you get hit
> with E(delay) = 0.5 seconds, independent of the execution time. For example,
> the MesosContainerizer uses mesos-fetcher in a Subprocess to fetch the
> executor during launch. At Twitter we fetch a local file, i.e., a very fast
> operation, but the launch is blocked until the mesos-fetcher pid is reaped ->
> adding 0 to 1 seconds for every launch!
> The problem is even worse with a chain of short Subprocesses because after
> the first Subprocess completes you'll be synchronized with the reap interval
> and you'll see nearly the full interval before notification, i.e., 10
> Subprocesses each of << 1 second duration with take ~10 seconds!
> This has become particularly apparent in some new tests I'm working on where
> test durations are now greatly extended with each taking several seconds.
--
This message was sent by Atlassian JIRA
(v6.2#6252)