[ https://issues.apache.org/jira/browse/KAFKA-6520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17115097#comment-17115097 ]
Vince Mu commented on KAFKA-6520: --------------------------------- [~mjsax] your explanation about coordinator metadata and liveness makes perfect sense. Thanks for that. Regarding the implementation of a disconnected timeout, I'm not sure whether introducing a disconnected timeout and measuring the timeout of each fetch request is necessary. It seems like the ConsumerNetworkClient and NetworkClient will already fail an unsent or transmitted fetch request with a disconnected exception it's connection to a node dies. So instead of throwing a disconnected exception based on whether all fetch requests timeout, we could instead throw a disconnected exception if all the fetch requests fail with a disconnected exception. I feel like this might be a simpler solution that uses what is already there. Thoughts on this? Please feel free to correct me. I'm still learning the code base bit by bit. > When a Kafka Stream can't communicate with the server, it's Status stays > RUNNING > -------------------------------------------------------------------------------- > > Key: KAFKA-6520 > URL: https://issues.apache.org/jira/browse/KAFKA-6520 > Project: Kafka > Issue Type: Improvement > Components: streams > Reporter: Michael Kohout > Priority: Major > Labels: newbie, user-experience > > KIP WIP: > [https://cwiki.apache.org/confluence/display/KAFKA/KIP-457%3A+Add+DISCONNECTED+status+to+Kafka+Streams] > When you execute the following scenario the application is always in RUNNING > state > > 1)start kafka > 2)start app, app connects to kafka and starts processing > 3)kill kafka(stop docker container) > 4)the application doesn't give any indication that it's no longer > connected(Stream State is still RUNNING, and the uncaught exception handler > isn't invoked) > > > It would be useful if the Stream State had a DISCONNECTED status. > > See > [this|https://groups.google.com/forum/#!topic/confluent-platform/nQh2ohgdrIQ] > for a discussion from the google user forum. This is a link to a related > issue. > ------------------------- > Update: there are some discussions on the PR itself which leads me to think > that a more general solution should be at the ClusterConnectionStates rather > than at the Streams or even Consumer level. One proposal would be: > * Add a new metric named `failedConnection` in SelectorMetrics which is > recorded at `connect()` and `pollSelectionKeys()` functions, upon capture the > IOException / RuntimeException which indicates the connection disconnected. > * And then users of Consumer / Streams can monitor on this metric, which > normally will only have close to zero values as we have transient > disconnects, if it is spiking it means the brokers are consistently being > unavailable indicting the state. > [~Yohan123] WDYT? -- This message was sent by Atlassian Jira (v8.3.4#803005)