klarose opened a new issue #11520:
URL: https://github.com/apache/druid/issues/11520


   ### Affected Version
   
   0.21.1
   
   ### Description
   
   We're running druid in Kubernetes. 2 replicas of each component (middle 
manager, broker, coordinator, historical, router). Our primary data source is 
kafka ingestion. Our cluster's nodes fail fairly regularly (it is running on 
GKE preemptible nodes).
   
   Some time over night we started noticing stack traces coming from many of 
the pods. E.g.
   
   ```
   [historical-0 druid] 
{"timeMillis":1627650366804,"thread":"org.apache.druid.k8s.discovery.K8sDruidNodeDiscoveryProvider$NodeRoleWatchercoordinator","level":"ERROR","loggerName":"org.apache.druid.k8s.discovery.K8sDruidNodeDiscoveryProvider$NodeRoleWatcher","message":"Error
 while watching node type 
[COORDINATOR]","thrown":{"commonElementCount":0,"name":"java.lang.NullPointerException","extendedStackTrace":[{"class":"org.apache.druid.k8s.discovery.DefaultK8sApiClient$2","method":"hasNext","file":"DefaultK8sApiClient.java","line":138,"exact":false,"location":"?","version":"?"},{"class":"org.apache.druid.k8s.discovery.K8sDruidNodeDiscoveryProvider$NodeRoleWatcher","method":"keepWatching","file":"K8sDruidNodeDiscoveryProvider.java","line":268,"exact":false,"location":"?","version":"?"},{"class":"org.apache.druid.k8s.discovery.K8sDruidNodeDiscoveryProvider$NodeRoleWatcher","method":"watch","file":"K8sDruidNodeDiscoveryProvider.java","line":237,"exact":false,"location":"?","version":"?"},
 
{"class":"java.util.concurrent.Executors$RunnableAdapter","method":"call","file":"Executors.java","line":515,"exact":true,"location":"?","version":"?"},{"class":"java.util.concurrent.FutureTask","method":"run","file":"FutureTask.java","line":264,"exact":true,"location":"?","version":"?"},{"class":"java.util.concurrent.ThreadPoolExecutor","method":"runWorker","file":"ThreadPoolExecutor.java","line":1128,"exact":true,"location":"?","version":"?"},{"class":"java.util.concurrent.ThreadPoolExecutor$Worker","method":"run","file":"ThreadPoolExecutor.java","line":628,"exact":true,"location":"?","version":"?"},{"class":"java.lang.Thread","method"
 
"run","file":"Thread.java","line":829,"exact":true,"location":"?","version":"?"}]},"endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","contextMap":{},"threadId":57,"threadPriority":5}
   
   ```
   
   ```
   [middle-manager-1 druid] 
{"timeMillis":1627650372166,"thread":"org.apache.druid.k8s.discovery.K8sDruidNodeDiscoveryProvider$NodeRoleWatcheroverlord","level":"ERROR","loggerName":"org.apache.druid.k8s.discovery.K8sDruidNodeDiscoveryProvider$NodeRoleWatcher","message":"Error
 while watching node type 
[OVERLORD]","thrown":{"commonElementCount":0,"name":"java.lang.NullPointerException"},"endOfBatch":false,"logge
 
Fqcn":"org.apache.logging.slf4j.Log4jLogger","contextMap":{},"threadId":103,"threadPriority":5}
   
   ```
   
   I've attached the backtrace isolated from the earlier log message as 
backtrace.txt
   
   [backtrace.txt](https://github.com/apache/druid/files/6907834/backtrace.txt)
   
   I took a quick look at the line at fault: 
https://github.com/apache/druid/blob/8296123d895db7d06bc4517db5e767afb7862b83/extensions-core/kubernetes-extensions/src/main/java/org/apache/druid/k8s/discovery/DefaultK8sApiClient.java#L138
   
   One possibility is that `object` is null.  It looks like it can be, from 
looking at the java kubernets client: 
https://github.com/kubernetes-client/java/blob/f20788272291c0e79a8c831d8d5a7dd94d96d2de/util/src/main/java/io/kubernetes/client/util/Watch.java#L63
   
   I don't quite understand what could lead to that -- perhaps and object being 
deleted? A blip in k8s itself? An event that is unrelated to an object yet is 
in the watch stream?
   
   It looks like restarting the pods fixes the problem, so I'm guessing that 
there's an event in the watch queue *after* the most recently fetched 
resourceVersion which triggers the problem, and that when we do hit this issue, 
it just restarts from the previously successful resourceVersion -- meaning it 
keeps hitting the problem until we restart and start from the most recent state 
of things.
   
   Either way, I think that the druid k8s client should handle this case.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to