[GitHub] [trafficcontrol] ocket8888 opened a new issue #5975: TM integration tests random enough to fail sometimes

GitBox Thu, 24 Jun 2021 12:01:55 -0700


ocket8888 opened a new issue #5975:
URL: https://github.com/apache/trafficcontrol/issues/5975



   ## I'm submitting a ...
   -  bug report
   
   ## Traffic Control components affected ...
   -  Traffic Monitor (integration tests)
   
   ## Current behavior:
   The Traffic Monitor integration tests have a check for its 
`/api/bandwidth-kbps` API endpoint. This endpoint returns the sum of the 
bandwidths of polled cache servers from the last time it polled. This endpoint 
is not mocked, its data comes from the `testcaches/fakesrvr` tool that 
populates the various testing mock ATS caches used by the tests. The bandwidth 
data is calculated by dividing the difference between the current and 
last-measured value of a field returned by an astats (or stats_over_http) 
request by the amount of time that passed between polls, multiplied by a 
constant of proportionality. Thus we have
   
   
![image](https://user-images.githubusercontent.com/6013378/123295382-e67b6800-d4d2-11eb-8c36-763a1b7a16ef.png)
   
   where `N` is the number of servers polled, <code>t<sub>n</sub></code> is the 
time elapsed between polls for cache `n`, <code>x<sub>n</sub></code> is the 
current value of the astats/stats field, <code>x<sub>n</sub><sup>′</sup></code> 
is the last-measured value of said field, and `k` is a proportionality constant.
   
   The field in question that it's measuring is calculated from a 
"`/proc/net/dev` line" which essentially boils down to this: the field is 
initially zero, but every second that passes the fake server adds a random 
amount on a certain interval for each configured "remap". The random interval 
itself is defined by the number of "remaps". Specifically, the interval minimum 
is always hard-coded to 0, but the maximum for the `i`<sup>th</sup> "remap" 
<code>r<sub>i</sub></code> is given by 
   
   
![image](https://user-images.githubusercontent.com/6013378/123303471-10d12380-d4db-11eb-80c4-18a9c067e89f.png)
   
   where <code>N<sub>r</sub></code> is the total number of "remaps", giving the 
upper bound of the full addition per second to <code>x<sub>n</sub></code> as:
   
   
![image](https://user-images.githubusercontent.com/6013378/123304060-b6849280-d4db-11eb-9c5a-d5a5b1df67d6.png)
   
   The number of remaps used in the GHA is hard-coded to 2, so this can be 
simplified:
   
   
![image](https://user-images.githubusercontent.com/6013378/123303950-98b72d80-d4db-11eb-9c0b-8f7b4b8c909d.png)
   
   So basically, as `t` is in seconds, this adds a random value on `[0,50)` 
every second to the "outBytes" used to determine bandwidth. The polling 
interval for Traffic Monitor in these tests is 6 seconds, so we can reasonably 
approximate that <code>x<sub>n</sub><sup>′</sup>=x<sub>n</sub>(t-6)</code> and 
<code>t<sub>n</sub>=6</code> for all `n`. So, the population average selections 
from the uniform distribution give us a normal distribution with a lower bound 
of 0, an expectation value of 24.5, and an upper bound of:
   
   
![image](https://user-images.githubusercontent.com/6013378/123315786-57c61580-d4e9-11eb-9dee-0b4d6bf794f6.png)
   
   ... since `N` is hard-coded to 2, and the proportionality constant `k` is 
the number of bytes in a kilobit, which is 125.
   
   The test in kbps_test.go checks that the value received is between 5000 and 
20000, this corresponds respectively to an emitted rate on the interval from 20 
to 49 (the upper bound on that check exceeds the upper bound of possible 
values). Which puts this lower bound near the 25<sup>th</sup> percentile of a 
roughly normal distribution (source: 
https://www.wolframalpha.com/input/?i=normal+distribution+mean%3D24.5+standard+deviation%3D4.94)
 with a mean of 24.5, although I'm not good at finding standard deviations or 
whatever so that might not be exactly right. The point is, there is a 
non-statistically-insignificant probability that the test will just randomly 
fail.
   
   ## Expected behavior:
   Tests should not rely on checking for ranges for extremely random data. The 
test should figure out what it's testing (marshalling the data? Accurate 
reporting of known data?) and test that exactly to avoid random failures.
   
   ## Minimal reproduction of the problem with instructions:
   Try running the TM integration tests a bunch


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [trafficcontrol] ocket8888 opened a new issue #5975: TM integration tests random enough to fail sometimes

Reply via email to