Re: Long tests, Burn tests, Simulator tests, Fuzz tests - can we clarify the diffs?

Josh McKenzie Fri, 08 Dec 2023 07:07:04 -0800

> Unit tests that fail consistently but only on one configuration, should not 
> be removed/replaced until the replacement also catches the failure.

> along the way, people have decided a certain configuration deserves 
> additional testing and it has been done this way in lieu of any other more 
> efficient approach.

Totally agree with these sentiments as well as the framing of our current unit 
tests as "bad fuzz-tests thanks to non-determinism".

To me, this reinforces my stance on a "pre-commit vs. post-commit" approach to 
testing *with our current constraints:*
 • Test the default configuration on all supported JDK's pre-commit
 • Post-commit, treat *consistent *failures on non-default configurations as 
immediate interrupts to the author that introduced them
 • Pre-release, push for no consistent failures on any suite in any 
configuration, and no regressions in flaky tests from prior release (in ASF CI 
env).
I think there's value in having the non-default configurations, but I'm not 
convinced the benefits outweigh the costs *specifically in terms of pre-commit 
work* due to flakiness in the execution of the software env itself, not to 
mention hardware env variance on the ASF side today.

All that said - if we got to a world where we could run our jvm-based tests 
deterministically within the simulator, my intuition is that we'd see a lot of 
the test-specific, non-defect flakiness reduced drastically. In such a world 
I'd be in favor of running :allthethings: pre-commit as we'd have *much* higher 
confidence that those failures were actually attributable to the author of 
whatever diff the test is run against. 

On Fri, Dec 8, 2023, at 8:25 AM, Mick Semb Wever wrote:
>  
>  
>  
>> 
>>> I think everyone agrees here, but…. these variations are still catching 
>>> failures, and until we have an improvement or replacement we do rely on 
>>> them.   I'm not in favour of removing them until we have proof /confidence 
>>> that any replacement is catching the same failures.  Especially oa, tries, 
>>> vnodes. (Not tries and offheap is being replaced with "latest", which will 
>>> be valuable simplification.)  
>> 
>> What kind of proof do you expect? I cannot imagine how we could prove that 
>> because the ability of detecting failures results from the randomness of 
>> those tests. That's why when such a test fail you usually cannot reproduce 
>> that easily.
> 
> 
> Unit tests that fail consistently but only on one configuration, should not 
> be removed/replaced until the replacement also catches the failure.
>  
>> We could extrapolate that to - why we only have those configurations? why 
>> don't test trie / oa + compression, or CDC, or system memtable? 
> 
> 
> Because, along the way, people have decided a certain configuration deserves 
> additional testing and it has been done this way in lieu of any other more 
> efficient approach.
> 
> 
>

Re: Long tests, Burn tests, Simulator tests, Fuzz tests - can we clarify the diffs?

Reply via email to