[ https://issues.apache.org/jira/browse/KAFKA-6249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16268753#comment-16268753 ]
Charles Crain commented on KAFKA-6249: -------------------------------------- It is possible I am seeing the delay due to the new instance. We use Kubernetes as our cluster manager, so if a node crashes, one is immediately brought back up to replace it. So our actual failure scenario is one node leaving, then another shortly joining. During this time, the cluster spends a long time in the rebalancing state, during which time some or all partitions of data are unavailable even when standby replicas are configured. If KAFKA-6144 is relating to querying standby replicas, and will also allow querying of stale data (i.e. data from replicas that are in the process of rebalancing but have not concluded yet), then yes I think you can close this as a duplicate. Ultimately our desired behavior is: as long as the number of offline nodes during a rebalance is never greater than the number of standby replicas, then the data should always be present on at least one node, and therefore queries should continue to work. > Interactive query downtime when node goes down even with standby replicas > ------------------------------------------------------------------------- > > Key: KAFKA-6249 > URL: https://issues.apache.org/jira/browse/KAFKA-6249 > Project: Kafka > Issue Type: Bug > Components: streams > Affects Versions: 1.0.0 > Reporter: Charles Crain > > In a multi-node Kafka Streams application that uses interactive queries, the > queryable store will become unavailable (throw InvalidStateStoreException) > for up to several minutes when a node goes down. This happens regardless of > how many nodes are in the application as well as how many standby replicas > are configured. > My expectation is that if a standby replica is present, that the interactive > query would fail over to the live replica immediately causing negligible > downtime for interactive queries. Instead, what appears to happen is that > the queryable store is down for however long it takes for the nodes to > completely rebalance (this takes a few minutes for a couple GB of total data > in the queryable store's backing topic). > I am filing this as a bug, realizing that it may in fact be a feature > request. However, until there is a way we can use interactive queries with > minimal (~zero) downtime on node failure, we are having to entertain other > strategies for serving queries (e.g. manually materializing the topic to an > external resilient store such as Cassandra) in order to meet our SLAs. > If there is a way to minimize the downtime of interactive queries on node > failure that I am missing, I would like to know what it is. > Our team is super-enthusiastic about Kafka Streams and we're keen to use it > for just about everything! This is our only major roadblock. -- This message was sent by Atlassian JIRA (v6.4.14#64029)