klarose opened a new pull request #12233: URL: https://github.com/apache/druid/pull/12233
Fixes #11520. ### Description Kubernetes watches allow a client to efficiently processes changes to resources. However, they have some idiosyncrasies. In particular, they can error out for various reasons leading to what would normally be seen as an invalid result. The Druid kubernetes node discovery subsystem does not handle a certain case properly. The watch can return an item with a null object. These leads to a null pointer exception. When this happens, the provider needs to restart the watch, because rerunning the watch from the same resource version leads to the same result: yet another null pointer exception. This commit changes the provider to handle null objects by restarting the watch. A clean alternative to this would be to change the provider to use an [Informer](https://github.com/kubernetes-client/java/blob/master/examples/examples-release-14/src/main/java/io/kubernetes/client/examples/InformerExample.java). I suspect this would simplify the code substantially while handling most if not all of the corner cases we could run into by using a bare watch. I don't quite have the time to undertake a large change like that, though, so I'm submitting this quick fix so that we can at least resolve the most common issue that seems to affect the kubernetes provider. <hr> ##### Key changed/added classes in this PR * DefaultK8sApiClient: now propagates the null object. Logs out a warning when this happens. * K8sDruidNodeDiscoveryProvider: handles the null from the watch by restarting it. Logs a warning. <hr> <!-- Check the items by putting "x" in the brackets for the done things. Not all of these items apply to every PR. Remove the items which are not done or not relevant to the PR. None of the items from the checklist below are strictly necessary, but it would be very helpful if you at least self-review the PR. --> This PR has: - [x] been self-reviewed. - [x] added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader. - [x] added unit tests or modified existing tests to cover new code paths, ensuring the threshold for [code coverage](https://github.com/apache/druid/blob/master/dev/code-review/code-coverage.md) is met. - [x] been tested in a test Druid cluster. Note on testing: I didn't add unit tests to DefaultK8sApiClient. The infrastructure to do so was not present, unfortunately, and I suspect it'd be a large undertaking. In terms of testing in a cluster, I reproduced the issue using [microk8s](https://microk8s.io/). I then reproduced it with my fix, showing that the error message no longer occurred in a tight loop, and discovery still worked (I restarted a pod. It discovered the new one) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
