klarose opened a new issue #11520:
URL: https://github.com/apache/druid/issues/11520
### Affected Version
0.21.1
### Description
We're running druid in Kubernetes. 2 replicas of each component (middle
manager, broker, coordinator, historical, router). Our primary data source is
kafka ingestion. Our cluster's nodes fail fairly regularly (it is running on
GKE preemptible nodes).
Some time over night we started noticing stack traces coming from many of
the pods. E.g.
```
[historical-0 druid]
{"timeMillis":1627650366804,"thread":"org.apache.druid.k8s.discovery.K8sDruidNodeDiscoveryProvider$NodeRoleWatchercoordinator","level":"ERROR","loggerName":"org.apache.druid.k8s.discovery.K8sDruidNodeDiscoveryProvider$NodeRoleWatcher","message":"Error
while watching node type
[COORDINATOR]","thrown":{"commonElementCount":0,"name":"java.lang.NullPointerException","extendedStackTrace":[{"class":"org.apache.druid.k8s.discovery.DefaultK8sApiClient$2","method":"hasNext","file":"DefaultK8sApiClient.java","line":138,"exact":false,"location":"?","version":"?"},{"class":"org.apache.druid.k8s.discovery.K8sDruidNodeDiscoveryProvider$NodeRoleWatcher","method":"keepWatching","file":"K8sDruidNodeDiscoveryProvider.java","line":268,"exact":false,"location":"?","version":"?"},{"class":"org.apache.druid.k8s.discovery.K8sDruidNodeDiscoveryProvider$NodeRoleWatcher","method":"watch","file":"K8sDruidNodeDiscoveryProvider.java","line":237,"exact":false,"location":"?","version":"?"},
{"class":"java.util.concurrent.Executors$RunnableAdapter","method":"call","file":"Executors.java","line":515,"exact":true,"location":"?","version":"?"},{"class":"java.util.concurrent.FutureTask","method":"run","file":"FutureTask.java","line":264,"exact":true,"location":"?","version":"?"},{"class":"java.util.concurrent.ThreadPoolExecutor","method":"runWorker","file":"ThreadPoolExecutor.java","line":1128,"exact":true,"location":"?","version":"?"},{"class":"java.util.concurrent.ThreadPoolExecutor$Worker","method":"run","file":"ThreadPoolExecutor.java","line":628,"exact":true,"location":"?","version":"?"},{"class":"java.lang.Thread","method"
"run","file":"Thread.java","line":829,"exact":true,"location":"?","version":"?"}]},"endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","contextMap":{},"threadId":57,"threadPriority":5}
```
```
[middle-manager-1 druid]
{"timeMillis":1627650372166,"thread":"org.apache.druid.k8s.discovery.K8sDruidNodeDiscoveryProvider$NodeRoleWatcheroverlord","level":"ERROR","loggerName":"org.apache.druid.k8s.discovery.K8sDruidNodeDiscoveryProvider$NodeRoleWatcher","message":"Error
while watching node type
[OVERLORD]","thrown":{"commonElementCount":0,"name":"java.lang.NullPointerException"},"endOfBatch":false,"logge
Fqcn":"org.apache.logging.slf4j.Log4jLogger","contextMap":{},"threadId":103,"threadPriority":5}
```
I've attached the backtrace isolated from the earlier log message as
backtrace.txt
[backtrace.txt](https://github.com/apache/druid/files/6907834/backtrace.txt)
I took a quick look at the line at fault:
https://github.com/apache/druid/blob/8296123d895db7d06bc4517db5e767afb7862b83/extensions-core/kubernetes-extensions/src/main/java/org/apache/druid/k8s/discovery/DefaultK8sApiClient.java#L138
One possibility is that `object` is null. It looks like it can be, from
looking at the java kubernets client:
https://github.com/kubernetes-client/java/blob/f20788272291c0e79a8c831d8d5a7dd94d96d2de/util/src/main/java/io/kubernetes/client/util/Watch.java#L63
I don't quite understand what could lead to that -- perhaps and object being
deleted? A blip in k8s itself? An event that is unrelated to an object yet is
in the watch stream?
It looks like restarting the pods fixes the problem, so I'm guessing that
there's an event in the watch queue *after* the most recently fetched
resourceVersion which triggers the problem, and that when we do hit this issue,
it just restarts from the previously successful resourceVersion -- meaning it
keeps hitting the problem until we restart and start from the most recent state
of things.
Either way, I think that the druid k8s client should handle this case.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]