Will Berkeley created KUDU-2589: ----------------------------------- Summary: ITClient is flaky under stress when leadership changes Key: KUDU-2589 URL: https://issues.apache.org/jira/browse/KUDU-2589 Project: Kudu Issue Type: Bug Affects Versions: 1.8.0 Reporter: Will Berkeley
Saw this failure in ITClient test: {{noformat}} 00:34:51.362 [DEBUG - New I/O worker #10] (AsyncKuduScanner.java:492) Can not open scanner org.apache.kudu.client.NonRecoverableException: Tablet hasn't heard from leader, or there hasn't been a stable leader for: 0.757s secs, (max is 0.750s): at org.apache.kudu.client.RpcProxy.dispatchTSError(RpcProxy.java:320) at org.apache.kudu.client.RpcProxy.responseReceived(RpcProxy.java:242) at org.apache.kudu.client.RpcProxy.access$000(RpcProxy.java:59) at org.apache.kudu.client.RpcProxy$1.call(RpcProxy.java:131) at org.apache.kudu.client.RpcProxy$1.call(RpcProxy.java:127) at org.apache.kudu.client.Connection.messageReceived(Connection.java:391) at org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70) at org.apache.kudu.client.Connection.handleUpstream(Connection.java:243) at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564) at org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791) at org.jboss.netty.handler.timeout.ReadTimeoutHandler.messageReceived(ReadTimeoutHandler.java:184) at org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70) at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564) at org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791) at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:296) at org.jboss.netty.handler.codec.oneone.OneToOneDecoder.handleUpstream(OneToOneDecoder.java:70) at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564) at org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791) at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:296) at org.jboss.netty.handler.codec.frame.FrameDecoder.unfoldAndFireMessageReceived(FrameDecoder.java:462) at org.jboss.netty.handler.codec.frame.FrameDecoder.callDecode(FrameDecoder.java:443) at org.jboss.netty.handler.codec.frame.FrameDecoder.messageReceived(FrameDecoder.java:303) at org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70) at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564) at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:559) at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:268) at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:255) at org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:88) at org.jboss.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108) at org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:337) at org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89) at org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178) at org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108) at org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {{noformat}} There was a new leader elected just before: {{noformat}} 00:34:50.953 [INFO - cluster stderr printer] (MiniKuduCluster.java:526) I0925 00:34:50.953722 18257 catalog_manager.cc:3758] T 07e99535cba24d8e991829485de22275 P 060cf49269fb4f6f901696d741e69303 reported cstate change: term changed from 1 to 2, leader changed from acaedc4ec505489dbc853d8b32bfc147 (127.16.196.2) to 060cf49269fb4f6f901696d741e69303 (127.16.196.3). New cstate: current_term: 2 leader_uuid: "060cf49269fb4f6f901696d741e69303" committed_config { opid_index: -1 OBSOLETE_local: false peers { permanent_uuid: "acaedc4ec505489dbc853d8b32bfc147" member_type: VOTER last_known_addr { host: "127.16.196.2" port: 58760 } health_report { overall_health: UNKNOWN } } peers { permanent_uuid: "060cf49269fb4f6f901696d741e69303" member_type: VOTER last_known_addr { host: "127.16.196.3" port: 37181 } health_report { overall_health: HEALTHY } } peers { permanent_uuid: "2f0ee942338c4fb7bce908b1c32f0bcb" member_type: VOTER last_known_addr { host: "127.16.196.1" port: 47865 } health_report { overall_health: UNKNOWN } } } {{noformat}} but it seems the 0.75 seconds the async scanner will wait for a leader partially overlapped with the election, and the remaining portion of the period wasn't enough to get an answer on who the new leader was. -- This message was sent by Atlassian JIRA (v7.6.3#76005)