rohangarg commented on issue #12700: URL: https://github.com/apache/druid/issues/12700#issuecomment-1191304588
I looked at the failing IT and have some observations (including test explanation) : 1. The test group verifies `RetryQueryRunner` feature for missing segments. The feature is needed incase a segment is dropped by a historical after getting a query scheduled by the broker on it but before processing of the segment. 2. Within that feature, the flaky test verifies the case where retries on missing segments are disallowed but the client is ok with incomplete results. 3. For verification, the test launches a normal historical and a bad historical. The bad historical announces that it can serve all the segments but reports the testing segment as missing whenever it is queried. A testing query on a single segment datasource is run 50 times assuming that the it will run atleast once on both good and bad historical. Finally, the test checks that the out of the 50 runs, there should be atleast one run where the query gave incomplete results and there should be atleast one run where the query gave complete results. Further, there shouldn't be any query failures since the client is OK with incomplete results. 4. The test failure tagged above says that it didn't find a run where the results were incomplete. I tried to reproduce it locally but couldn't do so even after more than 50 automated runs. Possible reason for flakiness could be related to the setup of historicals during tests which interacts with cloud storage as well to get pre-populated data. The test setup only verifies that the testing datasource is fully available. It might be a case that the good historical gets setup fine and the bad historical faces an independent problem, which leads to the testing datasource being fully available but failure in test since there'll be no incomplete results. For a fix, I think instead of accommodating more things in the exitising test, we can refactor the test as follows : 1. Remove the good historical setup from the query-retry tests 2. The bad historical reports missing segment for the same query for a configurable number of times. So, if the bad historical reports the segment missing for N times, a retry of N+1 should always make any query succeed with complete results. 3. Refactor the tests for tighter assertions based on the above two changes Incase the existing setup of two historicals is supposed to verify something in query-retry feature I'm missing, we can explore more solutions keeping the current setup intact. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
