[GitHub] [druid] klarose opened a new pull request #12233: kubernetes: restart watch on null response

GitBox Fri, 04 Feb 2022 08:38:06 -0800


klarose opened a new pull request #12233:
URL: https://github.com/apache/druid/pull/12233

Fixes #11520.

### Description

Kubernetes watches allow a client to efficiently processes changes to
resources. However, they have some idiosyncrasies. In particular, they
can error out for various reasons leading to what would normally be seen
as an invalid result.

The Druid kubernetes node discovery subsystem does not handle a certain
case properly. The watch can return an item with a null object. These
leads to a null pointer exception. When this happens, the provider needs
to restart the watch, because rerunning the watch from the same resource
version leads to the same result: yet another null pointer exception.

This commit changes the provider to handle null objects by restarting
the watch.

A clean alternative to this would be to change the provider to use an
[Informer](https://github.com/kubernetes-client/java/blob/master/examples/examples-release-14/src/main/java/io/kubernetes/client/examples/InformerExample.java).
I suspect this would simplify the code substantially while handling most if
not all of the corner cases we could run into by using a bare watch. I don't
quite have the time to undertake a large change like that, though, so I'm
submitting this quick fix so that we can at least resolve the most common issue
that seems to affect the kubernetes provider.

<hr>

##### Key changed/added classes in this PR
* DefaultK8sApiClient: now propagates the null object. Logs out a warning
when this happens.
* K8sDruidNodeDiscoveryProvider: handles the null from the watch by
restarting it. Logs a warning.
<hr>

This PR has:
- [x] been self-reviewed.
- [x] added comments explaining the "why" and the intent of the code
wherever would not be obvious for an unfamiliar reader.
- [x] added unit tests or modified existing tests to cover new code paths,
ensuring the threshold for [code
coverage](https://github.com/apache/druid/blob/master/dev/code-review/code-coverage.md)
is met.
- [x] been tested in a test Druid cluster.

Note on testing: I didn't add unit tests to DefaultK8sApiClient. The
infrastructure to do so was not present, unfortunately, and I suspect it'd be a
large undertaking. In terms of testing in a cluster, I reproduced the issue
using [microk8s](https://microk8s.io/). I then reproduced it with my fix,
showing that the error message no longer occurred in a tight loop, and
discovery still worked (I restarted a pod. It discovered the new one)

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [druid] klarose opened a new pull request #12233: kubernetes: restart watch on null response

Reply via email to