Nathan Hartman wrote:
On Wed, Jan 8, 2020 at 9:24 AM Daniel Shahaf <d...@daniel.shahaf.name> wrote:
> Julian Foad wrote on Wed, Jan 08, 2020 at 10:14:42 +0000:
> > The suggestion was that we should prefer the regression test suite to be
> > deterministic, [...]
> But unlike nearly all of the test suite, we don't actually know what the
> sequence of inputs _is_,

When developing/debugging, I made the test print each set of inputs, sometimes for every test case and sometimes just when a case failed. Some of that debug print code is still there, commented out.

> and we have no reason to believe it provides good
> coverage of edge cases and rare combinations of inputs. For all we know, the
> 90000 iterations could all be testing the exact same, most common codepath 
over
> and over again.

I inspected the debug prints whizzing by, and noted that it included each of the different kinds of input cases I expected (e.g. empty ranges, reversed ranges, duplicate ranges, overlapping ranges, etc.) and that the results included examples of the failure modes I expected.

Based on that observation, I altered the parameters such as the sizes of the generated revision ranges and the repetition counts, until it provided good coverage without excessive repetition. I aimed for "a few" (more than one, less than hundreds) of occurrences of the same failure mode.

That doesn't guarantee coverage of every possible edge case combination, of course, but it gives me confidence that it's an effective test. I admit it's an issue that that is not apparent to an observer. However, anyone needing to debug again can enable the debug prints and see it.

My earlier thinking (though I didn't spell it out) was to move the
PRNG code to a separate program, use it to generate a self-documenting
C array of unique test cases [...] If we discover new failing inputs in the
future, they can be added to the test suite. Then there is no guessing
about what inputs are being tested.

Capturing and storing generated cases into a fixed explicit could make some tasks slightly easier, perhaps tasks such as reaching the required state in a debugger, or extracting a particular case into a separate test.

What cases would you extract? One example of each currently known failure mode (and success mode) is probably useful; that's a start. I already made a separate test case for one failure mode, and it would be reasonable to do so for the others.

But what about capturing a sufficient set of input combinations that should be tested because they *might* in future result in different behaviours? That's hard. We can try to guess some cases that look like they would be worth testing, which is what we do when writing traditional fixed tests. But the idea of "random" input generation is to overcome the tendency to fail to guess in advance all the combinations that might reveal a problem. Pseudo-random inputs are good at that.

A good, dedicated random-testing infrastructure would make it easy to generate a set of fixed tests based on observing random testing combined with, say, code coverage information. If we had this test infrastructure, sure, but if we don't, I'm not seeing why doing that extraction and analysis manually would be worth the effort.

The point is, we get a wide coverage of input combinations. The coverage is neither theoretically nor empirically proven to be complete, but I demonstrated it is wide enough to be useful.

> Using a PRNG makes the test code harder to read and to maintain, makes it
> more difficult to diagnose FAILs, and increases the risk of a bug in the test
> code. What benefits are we getting that offset these risks and costs?
> > The way in which the PRNG algorithm is used is outside the PRNG's API promises.
> PRNG's make *no* promises on their outputs when used with a fixed seed. In
> light of this, what reason do we have to believe that the test achieves good 
coverage?

Used for test case generation, the PRNG only needs to generate a (fixed) sequence of numbers that with no significant correlation to the pattern of usage in generating test cases. We could reasonably call it a "useful sequence generator" instead of "PRNG", to remove the stigma attached to falling short of strong randomness guarantees.

It sounds like you feel there should be a much higher quality of randomness. I addressed coverage above.

> It's not necessary to use a PRNG to test a large/diverse set of inputs. You
> can achieve with plain old combinatorics (e.g., define 300 rangelist variables
> and test various combinations of them). I'd actually expect that to have
> better coverage than a PRNG with a fixed seed.

I'd be mildly interested to see the result of that analysis. In practice, however, what we have now is good enough.

In other words, replace the PRNG and leave all else as-is.
Generating all permutations when picking 2 items is easy:

I don't mind if someone wants to write the test code to generate cases a different way. I'm not seeing the need, myself.

If there is something simple I can do, that would satisfy your concerns, I would be happy to do so. That could be something like setting the seed to a time-based value instead of a fixed number, or replacing the PRNG implementation with another simple drop-in replacement, or some such. Let me know if there is something of that magnitude that would resolve the concerns. So far, I am unable to understand whether or what that would be.

- Julian

Reply via email to