rohangarg commented on issue #12700:
URL: https://github.com/apache/druid/issues/12700#issuecomment-1191304588

   I looked at the failing IT and have some observations (including test 
explanation) : 
   1. The test group verifies `RetryQueryRunner` feature for missing segments. 
The feature is needed incase a segment is dropped by a historical after getting 
a query scheduled by the broker on it but before processing of the segment. 
   2. Within that feature, the flaky test verifies the case where retries on 
missing segments are disallowed but the client is ok with incomplete results.
   3. For verification, the test launches a normal historical and a bad 
historical. The bad historical announces that it can serve all the segments but 
reports the testing segment as missing whenever it is queried. A testing query 
on a single segment datasource is run 50 times assuming that the it will run 
atleast once on both good and bad historical. Finally, the test checks that the 
out of the 50 runs, there should be atleast one run where the query gave 
incomplete results and there should be atleast one run where the query gave 
complete results. Further, there shouldn't be any query failures since the 
client is OK with incomplete results.
   4. The test failure tagged above says that it didn't find a run where the 
results were incomplete. I tried to reproduce it locally but couldn't do so 
even after more than 50 automated runs. Possible reason for flakiness could be 
related to the setup of historicals during tests which interacts with cloud 
storage as well to get pre-populated data. The test setup only verifies that 
the testing datasource is fully available. It might be a case that the good 
historical gets setup fine and the bad historical faces an independent problem, 
which leads to the testing datasource being fully available but failure in test 
since there'll be no incomplete results.
   
   For a fix, I think instead of accommodating more things in the exitising 
test, we can refactor the test as follows : 
   1. Remove the good historical setup from the query-retry tests
   2. The bad historical reports missing segment for the same query for a 
configurable number of times. So, if the bad historical reports the segment 
missing for N times, a retry of N+1 should always make any query succeed with 
complete results.
   3. Refactor the tests for tighter assertions based on the above two changes
   
   Incase the existing setup of two historicals is supposed to verify something 
in query-retry feature I'm missing, we can explore more solutions keeping the 
current setup intact.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to