[jira] [Comment Edited] (MESOS-1199) Subprocess is "slow" -> gated by process::reap poll interval

Craig Hansen-Sturm (JIRA) Fri, 01 Aug 2014 12:44:31 -0700

    [ 
https://issues.apache.org/jira/browse/MESOS-1199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14082780#comment-14082780
 ]


Craig Hansen-Sturm edited comment on MESOS-1199 at 8/1/14 7:42 PM:
-------------------------------------------------------------------

Completed testing.    Created a new reaping test which collects N-subprocess 
notifications after killing each child, while varying the reaping interval I.

Attached color-coded chart shows %CPU utilization when:

I = 2s,1s,500ms,250ms,125ms,63ms,31ms,16ms,8ms,4ms,2ms,1ms
N = 1,4,16,64,256

For example, with 64 child processes (orange curve) and a 250ms reaping 
interval, 21% of the CPU is used.   At 16ms, the machine is completely 
saturated.

The attached chart (waitpid.pdf) demonstrates that no polling mechanism which 
blocks on each child pid (in succession) will scale at lower polling intervals 
and higher process counts.

That said, I believe this chart demonstrates that we can safely lower the time 
to 500 or 250ms if child process count <=64.

I recommend that we immediately do this; however, we still need a general 
non-blocking notification mechanism which scales linearly.   

Assuming there is consensus on this, I would like to create a seperate JIRA 
which addresses lowering the polling interval, and keep this one open for the 
longer term solution.

Opinions ?



was (Author: craig-mesos):
Completed testing.    Created a new reaping test which collects N-subprocess 
notifications after killing each child, while varying the reaping interval I.

Attached color-coded chart shows %CPU utilization when:

I = 2s,1s,500ms,250ms,125ms,63ms,31ms,16ms,8ms,4ms,2ms,1ms
N = 1,4,16,64,256

For example, with 64 child processes (orange curve) and a 250ms reaping 
interval, 21% of the CPU is used.   At 16ms, the machine is completely 
saturated.

This chart demonstrates that no polling mechanism which blocks on each child 
pid (in succession) will scale at lower polling intervals and higher process 
counts.

That said, I believe this chart demonstrates that we can safely lower the time 
to 500 or 250ms if child process count <=64.

I recommend that we immediately do this; however, we still need a general 
non-blocking notification mechanism which scales linearly.   

Assuming there is consensus on this, I would like to create a seperate JIRA 
which addresses lowering the polling interval, and keep this one open for the 
longer term solution.

Opinions ?


> Subprocess is "slow" -> gated by process::reap poll interval
> ------------------------------------------------------------
>
>                 Key: MESOS-1199
>                 URL: https://issues.apache.org/jira/browse/MESOS-1199
>             Project: Mesos
>          Issue Type: Improvement
>    Affects Versions: 0.18.0
>            Reporter: Ian Downes
>            Assignee: Craig Hansen-Sturm
>         Attachments: wiatpid.pdf
>
>
> Subprocess uses process::reap to wait on the subprocess pid and set the exit 
> status. However, process::reap polls with a one second interval resulting in 
> a delay up to the interval duration before the status future is set.
> This means if you need to wait for the subprocess to complete you get hit 
> with E(delay) = 0.5 seconds, independent of the execution time. For example, 
> the MesosContainerizer uses mesos-fetcher in a Subprocess to fetch the 
> executor during launch. At Twitter we fetch a local file, i.e., a very fast 
> operation, but the launch is blocked until the mesos-fetcher pid is reaped -> 
> adding 0 to 1 seconds for every launch!
> The problem is even worse with a chain of short Subprocesses because after 
> the first Subprocess completes you'll be synchronized with the reap interval 
> and you'll see nearly the full interval before notification, i.e., 10 
> Subprocesses each of << 1 second duration with take ~10 seconds!
> This has become particularly apparent in some new tests I'm working on where 
> test durations are now greatly extended with each taking several seconds.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Comment Edited] (MESOS-1199) Subprocess is "slow" -> gated by process::reap poll interval

Reply via email to