[
https://issues.apache.org/jira/browse/AURORA-1939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16064306#comment-16064306
]
Reza Motamedi commented on AURORA-1939:
---------------------------------------
The negative CPU values can be caused by a dead child process whose pid is
reused as a descendant of the same parent. Let me explain how. First, remember
that CPU time reported by psutil, is the total CPU time spent to progress a
process.
Supposes at {{t_0 = 10}}, we have the following process tree forked inside a
thermos process.
{noformat}
__ p0
\_ p1
{noformat}
The total CPU time of the thermos process is calculated at the CPU time in all
the processes, i.e., {{Process(p_0).cpu_time + Process(p_1).cpu_time}}, For the
sake of argument, let's say 1 second in {{p_0}} and 5 seconds in {{p_1}}.
Now imagine that by the time to collect the next sample at {{t_1 = 20}}, p_1
dies. However, {{p_0}} forks a new child and quite luckily the same **pid** is
assigned to the child. Let's call this new child {{p_1'}}. Assume an additional
1 sec is spent by {{p_0}} and 1 second is spent in {{p_1'}}. The current
calculation leads to the following reported CPU values.
(new_sample - old_sample) / (time difference).
(2 + 1) - (1 + 5) / 10 = -3/10.
While current calculation discards the old processes that have died during the
last time interval, it does the lookup only by pid. If we correctly identify
the difference between {{p_1}} and {{p_1'}}, the math leads to:
(new_sample - old_sample) / (time difference).
(2 + 1) - (1) / 10 = 2/10.
---
I'm going to confirm the reason behind the negative CPU times is reused pids.
> Thermos landing (host) page reports incorrect CPU rates when it is busy
> -----------------------------------------------------------------------
>
> Key: AURORA-1939
> URL: https://issues.apache.org/jira/browse/AURORA-1939
> Project: Aurora
> Issue Type: Bug
> Reporter: Reza Motamedi
> Priority: Minor
>
> Thermos Observer uses `psutil` to monitor resource consumption of Thermos
> Processes. On a busy machine, I have noticed negative CPU values when
> visiting the Thermos landing page.
> In my test I reproduced this by starting many processes that constantly
> create short lived children. This indicates that in time between
> `process_collector_psutil` looks up the Process children and the time it
> calculates the CPU time the pid of the child is actually reused by another
> much younger process, which leads to negative CPU times.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)