[
https://issues.apache.org/jira/browse/DAFFODIL-1510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Mike Beckerle closed DAFFODIL-1510.
-----------------------------------
Assignee: Mike Beckerle
Resolution: Not A Problem
This is an Owl-internal thing. Not part of Daffodil.
> Improve performance report with variance information
> ----------------------------------------------------
>
> Key: DAFFODIL-1510
> URL: https://issues.apache.org/jira/browse/DAFFODIL-1510
> Project: Daffodil
> Issue Type: Improvement
> Components: Performance, QA
> Reporter: Mike Beckerle
> Assignee: Mike Beckerle
> Priority: Major
>
> A big improvement for these reports would be to make them
> "self-noise-eliminating", so unlike the report attached, one could eliminate
> all the red-lights that are about deltas that are "in the noise".
> We want to attract attention (i.e., red-light) deltas that represent a
> statistically significant drop in performance. This can be a drop relative to
> prior performance of this branch, or a drop relative to prior performance of
> a baseline release.
> To do this you need variance-based statistics like Z-score, which is based on
> standard deviation. Z-score means "how many standard deviations away from the
> mean is this value." Z-score's between -1 and 1 imply "it's ordinary
> variation, due to noise most likely". Z-score outside of -1 to 1 implies
> "it's significant. take a look."
> We need the mean and standard deviation of (previousVal - baselineVal). We
> can then compute (currentVal - baselineVal), and if its z-score is < -1.0,
> then we would red-light the value - it means there is a statistically
> significant degradation in performance (relative to the baseline) due to this
> commit's code changes. This would only red-light changes due to this code
> commit. If a test performance is relatively unchanged day to day, but always
> slow relative to the baseline, this would not red-light that day's delta.
> We probably also want to red-light if there is a general degradation in
> performance even for tests that are running faster than the baseline, so we
> would also want mean and standard deviation of previousVal, and similarly
> red-light if the delta z-score (relative to previousVal) is < -1.0.
> And we want to red-light (or pink-light) tests that are simply slower than
> the baseline by a statistically significant amount as an ongoing trend. So we
> would include the currentVal in the mean and stdDev(previousVal), and for
> mean and stdDev(previousVal - baselineVal). Like everything else here, the
> assumption is these values are time taken, so lower is better/faster. If the
> mean of previousVal-baselineVal is negative by more than the
> stdDev(previousVal - baselineVal), then the trend is that this test is slower
> than the baseline by a significant amount on an ongoing basis, so we should
> "pink light" the test results. That particular day's run might or might not
> have reflected a statistically significant improvement or degradation, but
> the trend is still below the baseline by a statistically significant amount.
> This takes all the noise variability out of the color highlighting.
> Example:
> baseline is 200, previous is 150, current 139. Mean of prev-baseline is 175,
> and std-dev of prev-baseline is 12.
> So, current - prev-baseline is -36. Z-score of that is -3.0 which is < -1.0.
> So red-light goes on.
> Example 2:
> Current is 120. Mean of previous is 142, standard deviation of previous is 12.
> Delta from mean is -22. zscore is -22/12 = -1.83 which is < -1.0, so we
> red-light this because it represents a statistically significant drop in
> performance from the average for that test.
> Example 3:
> Current is 120, folding that into mean and std deviation of (previous -
> baseline) gives mean -20 stdDev of 10. That means the test is generally 20
> units slower than the baseline. The z-score of -20 relative to stdDev 10 is
> -2.0, so we would "pink light" the test, as generally being slower than the
> baseline on an ongoing basis.
> The inverse of these - statistically significant improvements, could generate
> green-light, (or light-green).
> To compute this you need at least 12 points of history so that you can have a
> meaningful mean and standard deviation to compute from.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)