paul-rogers commented on PR #12818:
URL: https://github.com/apache/druid/pull/12818#issuecomment-1196168007

   @rohangarg, thanks for fixing this flaky test! Thanks also for your detailed 
analysis. From prior observations, and based on your analysis, here's my guess 
as to what was happening. We spin up the Docker cluster and then start tests. 
The current ITs don't check the status of the cluster before they run: they 
simply rely on the retry utility to handle any failures.
   
   I've noticed that, on a single machine, it can take significant time for the 
cluster to come up. The "new ITs" specifically check the health of each service 
before starting the tests, and I've seen that this can take a while.
   
   So, in the scenario you describe, it may be that the good historical comes 
up and serves the 50 queries before the bad one has had a chance to get itself 
organized. So, an alternative fix would have been to check the health of all 
services before running the tests.
   
   Still, that leaves the statistical nature of the failure detection, which is 
unappealing. The deterministic mechanism you implemented seems much better.
   
   One suggestion: please add your explanation of the new behavior to the test 
code itself as a comment. I had a hard time figuring out what the code was 
doing until I read your explanation, which cleared thing right up. Would be 
great to preserve that explanation for future readers of the code. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to