klarose opened a new pull request #12233:
URL: https://github.com/apache/druid/pull/12233


   
   
   Fixes #11520.
   
   
   
   ### Description
   
   Kubernetes watches allow a client to efficiently processes changes to
   resources. However, they have some idiosyncrasies. In particular, they
   can error out for various reasons leading to what would normally be seen
   as an invalid result.
   
   The Druid kubernetes node discovery subsystem does not handle a certain
   case properly. The watch can return an item with a null object.  These
   leads to a null pointer exception. When this happens, the provider needs
   to restart the watch, because rerunning the watch from the same resource
   version leads to the same result: yet another null pointer exception.
   
   This commit changes the provider to handle null objects by restarting
   the watch.
   
   A clean alternative to this would be to change the provider to use an 
[Informer](https://github.com/kubernetes-client/java/blob/master/examples/examples-release-14/src/main/java/io/kubernetes/client/examples/InformerExample.java).
 I suspect this would simplify the code substantially while handling most if 
not all of the corner cases we could run into by using a bare watch. I don't 
quite have the time to undertake a large change like that, though, so I'm 
submitting this quick fix so that we can at least resolve the most common issue 
that seems to affect the kubernetes provider.
   
   
   <hr>
   
   ##### Key changed/added classes in this PR
   * DefaultK8sApiClient: now propagates the null object. Logs out a warning 
when this happens.
   * K8sDruidNodeDiscoveryProvider: handles the null from the watch by 
restarting it. Logs a warning.
   <hr>
   
   <!-- Check the items by putting "x" in the brackets for the done things. Not 
all of these items apply to every PR. Remove the items which are not done or 
not relevant to the PR. None of the items from the checklist below are strictly 
necessary, but it would be very helpful if you at least self-review the PR. -->
   
   This PR has:
   - [x] been self-reviewed.
   - [x] added comments explaining the "why" and the intent of the code 
wherever would not be obvious for an unfamiliar reader.
   - [x] added unit tests or modified existing tests to cover new code paths, 
ensuring the threshold for [code 
coverage](https://github.com/apache/druid/blob/master/dev/code-review/code-coverage.md)
 is met.
   - [x] been tested in a test Druid cluster.
   
   
   Note on testing: I didn't add unit tests to DefaultK8sApiClient. The 
infrastructure to do so was not present, unfortunately, and I suspect it'd be a 
large undertaking. In terms of testing in a cluster, I reproduced the issue 
using [microk8s](https://microk8s.io/). I then reproduced it with my fix, 
showing that the error message no longer occurred in a tight loop, and 
discovery still worked (I restarted a pod. It discovered the new one)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to