wuYin edited a comment on issue #9297: URL: https://github.com/apache/pulsar/issues/9297#issuecomment-767280833
@congbobo184 Thanks for review I'm using pulsar-helm-chart to deploy cluster, in proxy.conf, broker connection addresses looks likeļ¼ ``` brokerServiceURL=pulsar://handshake-pulsar-broker:6650 brokerWebServiceURL=http://handshake-pulsar-broker:8080 ``` which generated by [proxy-configmap.yaml#L37](https://github.com/apache/pulsar-helm-chart/blob/master/charts/pulsar/templates/proxy-configmap.yaml#L37), In proxy pod: ``` > cat /etc/resolv.conf search psr.svc.cluster.local svc.cluster.local cluster.local > host handshake-pulsar-broker handshake-pulsar-broker.psr.svc.cluster.local has address 10.113.42.32 handshake-pulsar-broker.psr.svc.cluster.local has address 10.113.43.53 # will be removed handshake-pulsar-broker.psr.svc.cluster.local has address 10.113.46.57 > host handshake-pulsar-broker-1.handshake-pulsar-broker.psr.svc.cluster.local # bundle owner host handshake-pulsar-broker-1.handshake-pulsar-broker.psr.svc.cluster.local has address 10.113.43.53 ``` For this issue, during broker1 restarting/terminating, it's service DNS record will be removed quickly(within 1s) Proxy request to other brokers to do Lookup, due to broker1 related zNode not expired yet, other brokers still returned `broker1.xxx.cluster.local` which has been removed, finally lead to client backoff retry the same Lookup. I think it's reasonable, but there's still small chance to trigger flaky case In my production env, I drain a k8s node caused a broker be scheduled to another node, but client even retried 16min Lookup still failed. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
