[ 
https://issues.apache.org/jira/browse/DAFFODIL-1510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mike Beckerle closed DAFFODIL-1510.
-----------------------------------
      Assignee: Mike Beckerle
    Resolution: Not A Problem

This is an Owl-internal thing. Not part of Daffodil.

> Improve performance report with variance information
> ----------------------------------------------------
>
>                 Key: DAFFODIL-1510
>                 URL: https://issues.apache.org/jira/browse/DAFFODIL-1510
>             Project: Daffodil
>          Issue Type: Improvement
>          Components: Performance, QA
>            Reporter: Mike Beckerle
>            Assignee: Mike Beckerle
>            Priority: Major
>
> A big improvement for these reports would be to make them 
> "self-noise-eliminating", so unlike the report attached, one could eliminate 
> all the red-lights that are about deltas that are "in the noise".
> We want to attract attention (i.e., red-light) deltas that represent a 
> statistically significant drop in performance. This can be a drop relative to 
> prior performance of this branch, or a drop relative to prior performance of 
> a baseline release.
> To do this you need variance-based statistics like Z-score, which is based on 
> standard deviation. Z-score means "how many standard deviations away from the 
> mean is this value." Z-score's between -1 and 1 imply "it's ordinary 
> variation, due to noise most likely". Z-score outside of -1 to 1 implies 
> "it's significant. take a look."
> We need the mean and standard deviation of (previousVal - baselineVal). We 
> can then compute (currentVal - baselineVal), and if its z-score is < -1.0, 
> then we would red-light the value - it means there is a statistically 
> significant degradation in performance (relative to the baseline) due to this 
> commit's code changes. This would only red-light changes due to this code 
> commit. If a test performance is relatively unchanged day to day, but always 
> slow relative to the baseline, this would not red-light that day's delta.
> We probably also want to red-light if there is a general degradation in 
> performance even for tests that are running faster than the baseline, so we 
> would also want mean and standard deviation of previousVal, and similarly 
> red-light if the delta z-score (relative to previousVal) is < -1.0.
> And we want to red-light (or pink-light) tests that are simply slower than 
> the baseline by a statistically significant amount as an ongoing trend. So we 
> would include the currentVal in the mean and stdDev(previousVal), and for 
> mean and stdDev(previousVal - baselineVal). Like everything else here, the 
> assumption is these values are time taken, so lower is better/faster. If the 
> mean of previousVal-baselineVal is negative by more than the 
> stdDev(previousVal - baselineVal), then the trend is that this test is slower 
> than the baseline by a significant amount on an ongoing basis, so we should 
> "pink light" the test results. That particular day's run might or might not 
> have reflected a statistically significant improvement or degradation, but 
> the trend is still below the baseline by a statistically significant amount.
> This takes all the noise variability out of the color highlighting.
> Example:
> baseline is 200, previous is 150, current 139. Mean of prev-baseline is 175, 
> and std-dev of prev-baseline is 12.
> So, current - prev-baseline is -36. Z-score of that is -3.0 which is < -1.0. 
> So red-light goes on.
> Example 2:
> Current is 120. Mean of previous is 142, standard deviation of previous is 12.
> Delta from mean is -22. zscore is -22/12 = -1.83 which is < -1.0, so we 
> red-light this because it represents a statistically significant drop in 
> performance from the average for that test.
> Example 3:
> Current is 120, folding that into mean and std deviation of (previous - 
> baseline) gives mean -20 stdDev of 10. That means the test is generally 20 
> units slower than the baseline. The z-score of -20 relative to stdDev 10 is 
> -2.0, so we would "pink light" the test, as generally being slower than the 
> baseline on an ongoing basis.
> The inverse of these - statistically significant improvements, could generate 
> green-light, (or light-green).
> To compute this you need at least 12 points of history so that you can have a 
> meaningful mean and standard deviation to compute from.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to