Julian>You seem to believe that the only purpose of a test is to test the
specific bug or change for which it was introduced

I'm afraid you seem to put words in my mouth.

Note: so far I have technical reasons to keep the test with reduced number
of lines while you don't.

Note2: original test included SQL with 100 or so lines. Somehow that
pleases you. I've reduced the test to 10 or so, and it pisses you off.
Could you please provide a technical clarification/justification on what
number of SQL lines is enough in that test?
I'm sure you can't, so could we just agree that you over-reacted on "commit
88f125541"?

You might ignore the below part, however you might learn a bit on
property-based testing and/or fuzzy-testing and/or performance testing.


Of course a test method can test zillions of scenarios "by accident".

However, each test should have some EXPECTED behaviour.
For instance:
a) "the number of HEP applied rules equals to 42 union all union all query"
b) "the number of HEP applied rules is less when DEPTH_FIRST order is used
for union all union all query"
b*) "the number of HEP applied rules is less when DEPTH_FIRST order is used
for ANY query"
c) "the number of HEP applied rules grows linearly as the number of unions
grows"
d) "planning time grows linearly as the number of unions grows"
e) "planning time fits in X seconds for query of size Y at CPU of Z GHz"
f) "Calcite does not format hard drive when execute union all union all
query"
g) "Calcite is able to sustain load for at least 1 week and avoid out of
memory, performance degradations, etc, etc"
h) "your pick"

I find risks a..g quite important for Calcite.

I'm sure testRuleApplyCount can cover cases a and b only, and for those
cases it does not require to process 1000 unions.

a is straight-forward. It does not require to have 1000'000 iterations in
order to ensure "a".
b is straight-forward as well as long as we are dealing with single SQL.

b*, c is not that trivial to prove since queries might have different size
and shape.
Property-based testing suits here: one can generate a query, apply two
different HEP programs and compare if DEPTH_FIRST produces result faster.
Current test verifies just a single SQL shape, however property-based test
could pick random shape and size, and it could validate new SQL every time.
Note: that randomization could still fit in 5 seconds, however those 5
seconds would be multiplied by the number of test-suite executions, so it
would result in quite good coverage.
Old code of testRuleApplyCount massaged just a single SQL again and again,
so it very unlikely to find a true bug.

CALCITE-2506
<https://issues.apache.org/jira/projects/CALCITE/issues/CALCITE-2506> is
and example of bug identified by fuzzing + property-based testing.

d and e

performance tests (e.g. measurement of planning duration) require more than
a single point.
You can't measure performance by executing a single test on a random
hardware.
One would likely require isolated hardware as there's no way to measure
responses time on a shared environment.
If planning sequence like in testRuleApplyCount is truly important, it is a
good candidate for inclusion to ubenchmark suite. There we could
use @Param({1, 10, 100, 1000}) numberOfUnions and see how response time
grows as number of unions is changed.


f) Calcite does not format hard drive...
The tests for unexpected behaviour are better be done via fuzzing.
That is one could craft a randomized SQL, pass it to Calcite (with
randomized set of activated rules) and see what happens.
OutOfMemory and/or StackOverflow would be a strong sign of a bug in Calcite.

Here's an example of such issues: CALCITE-2527
<https://issues.apache.org/jira/projects/CALCITE/issues/CALCITE-2527>,
CALCITE-2526
<https://issues.apache.org/jira/projects/CALCITE/issues/CALCITE-2526>

g) "Calcite is able to sustain load for at least 1 week and avoid out of
memory, performance degradations, etc, etc"

This is called endurance testing, and the way to test there is to run a
test for a week.
You can't emulate that in a unit-test that has to complete in a reasonable
time. "1 week" is not reasonable duration for unit test execution.

Vladimir

Reply via email to