paul-rogers commented on PR #12818: URL: https://github.com/apache/druid/pull/12818#issuecomment-1196168007
@rohangarg, thanks for fixing this flaky test! Thanks also for your detailed analysis. From prior observations, and based on your analysis, here's my guess as to what was happening. We spin up the Docker cluster and then start tests. The current ITs don't check the status of the cluster before they run: they simply rely on the retry utility to handle any failures. I've noticed that, on a single machine, it can take significant time for the cluster to come up. The "new ITs" specifically check the health of each service before starting the tests, and I've seen that this can take a while. So, in the scenario you describe, it may be that the good historical comes up and serves the 50 queries before the bad one has had a chance to get itself organized. So, an alternative fix would have been to check the health of all services before running the tests. Still, that leaves the statistical nature of the failure detection, which is unappealing. The deterministic mechanism you implemented seems much better. One suggestion: please add your explanation of the new behavior to the test code itself as a comment. I had a hard time figuring out what the code was doing until I read your explanation, which cleared thing right up. Would be great to preserve that explanation for future readers of the code. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
