rohangarg commented on PR #12818:
URL: https://github.com/apache/druid/pull/12818#issuecomment-1198137176

   @paul-rogers : Thanks a lot for review and the feedback! 
   
   > We spin up the Docker cluster and then start tests. The current ITs don't 
check the status of the cluster before they run: they simply rely on the retry 
utility to handle any failures.
   I've noticed that, on a single machine, it can take significant time for the 
cluster to come up. The "new ITs" specifically check the health of each service 
before starting the tests, and I've seen that this can take a while.
   So, in the scenario you describe, it may be that the good historical comes 
up and serves the 50 queries before the bad one has had a chance to get itself 
organized.
   
   Yes, that's what my conjecture was too regarding the flakiness. The current 
test only has some application level checks which acts as a proxy for test 
setup valdiation too, which seems a bit shaky. The delay in test setups due to 
load on CI is probably the reason I can't reproduce it locally. Great to know 
that the new framework has first class support for test setup check!
   
   > So, an alternative fix would have been to check the health of all services 
before running the tests.
   Still, that leaves the statistical nature of the failure detection, which is 
unappealing. The deterministic mechanism you implemented seems much better.
   
   Yes, I had thought of that approach but didn't move in that direction unless 
we have a good reason. Seemed better to me to make the test verify only the 
query-retry mechanism on the bad historical.
   
   > One suggestion: please add your explanation of the new behavior to the 
test code itself as a comment. I had a hard time figuring out what the code was 
doing until I read your explanation, which cleared thing right up. Would be 
great to preserve that explanation for future readers of the code.
   
   Yes, totally agree with you. I have added more explanation to the existing 
code documentation in the test and also updated some of it which has become 
outdated after the re-write. Thanks for catching that! 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to