[ https://issues.apache.org/jira/browse/KUDU-2387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16720732#comment-16720732 ]
Adar Dembo commented on KUDU-2387: ---------------------------------- I spent some time independently looking into this before realizing there was already a JIRA filed. Below are my notes, stored for posterity. When {{ConnectToCluster}} fails to find a leader master, it'll invoke its deferred's callback with an exception of type {{RecoverableException}}. As far as I can tell, whether or not the request is retried depends on what errbacks have been chained to that deferred. If we reached {{ConnectToCluster}} via {{sendRpcToTablet()}}, it ensures that a {{RetryRpcErrback}} is chained to the deferred, and when invoked, that errback looks at the exception type, and if it's recoverable, it schedules a retry of the RPC. So the important observation here is that invoking a deferred with a {{RecoverableException}} isn't enough for an RPC to be retried; we must ensure that your call chain includes {{RetryRpcErrback}} (usually done by calling {{sendRpcToTablet()}}). So what's going on? Well, {{exportAuthenticationCredentials()}} uses {{ConnectToCluster}} via {{getMasterTableLocationsPB()}}, but there's no wrapping in {{sendRpcToTablet()}}. So I don't see who on the call chain is going to retry the {{RecoverableException}}, and it makes sense that the exception just propagates up the stack and fails the application. BTW, {{getHiveMetastoreConfig()}} is equally vulnerable to this. Here's the log that helped me make sense of this. It's a failure in DefaultSourceTest with the cluster logging stripped away: {noformat} 20:09:07.846 [INFO - Test worker] (RandomUtils.java:46) Using random seed: 1543896547846 20:09:07.847 [INFO - Test worker] (KuduTestHarness.java:137) Creating a new MiniKuduCluster... 20:09:07.847 [INFO - Test worker] (MiniKuduCluster.java:146) Using the temp directory defined by TEST_TMPDIR: /data/somelongdirectorytoavoidrpathissues/src/kudutest 20:09:07.847 [INFO - Test worker] (KuduBinaryLocator.java:52) Using Kudu binary directory specified by system property 'kuduBinDir': /data/somelongdirectorytoavoidrpathissues/src/kudu/java/../build/latest/bin 20:09:07.847 [INFO - Test worker] (MiniKuduCluster.java:197) Starting process: [/data/somelongdirectorytoavoidrpathissues/src/kudu/java/../build/latest/bin/kudu, test, mini_cluster, --serialization=pb] 20:09:08.266 [INFO - Test worker] (KuduTestHarness.java:139) Creating a new Kudu client... 20:09:08.337 [WARN - New I/O worker #587] (ConnectToCluster.java:278) None of the provided masters 127.27.137.61:53849,127.27.137.60:37363,127.27.137.62:55005 is a leader; will retry 20:09:08.373 [ERROR - Test worker] (RetryRule.java:76) testUpsertRowsIgnoreNulls(org.apache.kudu.spark.kudu.DefaultSourceTest): failed run 1 java.security.PrivilegedActionException: org.apache.kudu.client.NoLeaderFoundException: Master config (127.27.137.61:53849,127.27.137.60:37363,127.27.137.62:55005) has no leader. at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:360) at org.apache.kudu.spark.kudu.KuduContext.<init>(KuduContext.scala:122) at org.apache.kudu.spark.kudu.KuduContext.<init>(KuduContext.scala:65) at org.apache.kudu.spark.kudu.KuduTestSuite$class.setUpBase(KuduTestSuite.scala:129) at org.apache.kudu.spark.kudu.DefaultSourceTest.setUpBase(DefaultSourceTest.scala:38) at sun.reflect.GeneratedMethodAccessor60.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:483) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:24) at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) at org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48) at org.apache.kudu.test.junit.RetryRule$RetryStatement.evaluate(RetryRule.java:72) at org.junit.rules.RunRules.evaluate(RunRules.java:20) at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57) at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) at org.junit.runners.ParentRunner.run(ParentRunner.java:363) at org.gradle.api.internal.tasks.testing.junit.JUnitTestClassExecutor.runTestClass(JUnitTestClassExecutor.java:106) at org.gradle.api.internal.tasks.testing.junit.JUnitTestClassExecutor.execute(JUnitTestClassExecutor.java:58) at org.gradle.api.internal.tasks.testing.junit.JUnitTestClassExecutor.execute(JUnitTestClassExecutor.java:38) at org.gradle.api.internal.tasks.testing.junit.AbstractJUnitTestClassProcessor.processTestClass(AbstractJUnitTestClassProcessor.java:66) at org.gradle.api.internal.tasks.testing.SuiteTestClassProcessor.processTestClass(SuiteTestClassProcessor.java:51) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:483) at org.gradle.internal.dispatch.ReflectionDispatch.dispatch(ReflectionDispatch.java:35) at org.gradle.internal.dispatch.ReflectionDispatch.dispatch(ReflectionDispatch.java:24) at org.gradle.internal.dispatch.ContextClassLoaderDispatch.dispatch(ContextClassLoaderDispatch.java:32) at org.gradle.internal.dispatch.ProxyDispatchAdapter$DispatchingInvocationHandler.invoke(ProxyDispatchAdapter.java:93) at com.sun.proxy.$Proxy2.processTestClass(Unknown Source) at org.gradle.api.internal.tasks.testing.worker.TestWorker.processTestClass(TestWorker.java:117) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:483) at org.gradle.internal.dispatch.ReflectionDispatch.dispatch(ReflectionDispatch.java:35) at org.gradle.internal.dispatch.ReflectionDispatch.dispatch(ReflectionDispatch.java:24) at org.gradle.internal.remote.internal.hub.MessageHubBackedObjectConnection$DispatchWrapper.dispatch(MessageHubBackedObjectConnection.java:155) at org.gradle.internal.remote.internal.hub.MessageHubBackedObjectConnection$DispatchWrapper.dispatch(MessageHubBackedObjectConnection.java:137) at org.gradle.internal.remote.internal.hub.MessageHub$Handler.run(MessageHub.java:404) at org.gradle.internal.concurrent.ExecutorPolicy$CatchAndRecordFailures.onExecute(ExecutorPolicy.java:63) at org.gradle.internal.concurrent.ManagedExecutorImpl$1.run(ManagedExecutorImpl.java:46) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at org.gradle.internal.concurrent.ThreadFactoryImpl$ManagedThreadRunnable.run(ThreadFactoryImpl.java:55) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.kudu.client.NoLeaderFoundException: Master config (127.27.137.61:53849,127.27.137.60:37363,127.27.137.62:55005) has no leader. at org.apache.kudu.client.KuduException.transformException(KuduException.java:110) at org.apache.kudu.client.KuduClient.joinAndHandleException(KuduClient.java:365) at org.apache.kudu.client.KuduClient.exportAuthenticationCredentials(KuduClient.java:316) at org.apache.kudu.spark.kudu.KuduContext$$anon$1.run(KuduContext.scala:124) at org.apache.kudu.spark.kudu.KuduContext$$anon$1.run(KuduContext.scala:122) ... 56 more Suppressed: org.apache.kudu.client.KuduException$OriginalException: Original asynchronous stack trace at org.apache.kudu.client.ConnectToCluster.incrementCountAndCheckExhausted(ConnectToCluster.java:279) at org.apache.kudu.client.ConnectToCluster.access$100(ConnectToCluster.java:47) at org.apache.kudu.client.ConnectToCluster$ConnectToMasterCB.call(ConnectToCluster.java:323) at org.apache.kudu.client.ConnectToCluster$ConnectToMasterCB.call(ConnectToCluster.java:312) at com.stumbleupon.async.Deferred.doCall(Deferred.java:1280) at com.stumbleupon.async.Deferred.runCallbacks(Deferred.java:1259) at com.stumbleupon.async.Deferred.callback(Deferred.java:1002) at org.apache.kudu.client.KuduRpc.handleCallback(KuduRpc.java:247) at org.apache.kudu.client.KuduRpc.callback(KuduRpc.java:294) at org.apache.kudu.client.RpcProxy.responseReceived(RpcProxy.java:269) at org.apache.kudu.client.RpcProxy.access$000(RpcProxy.java:59) at org.apache.kudu.client.RpcProxy$1.call(RpcProxy.java:131) at org.apache.kudu.client.RpcProxy$1.call(RpcProxy.java:127) at org.apache.kudu.client.Connection.messageReceived(Connection.java:391) at org.apache.kudu.shaded.org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70) at org.apache.kudu.client.Connection.handleUpstream(Connection.java:243) at org.apache.kudu.shaded.org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564) at org.apache.kudu.shaded.org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791) at org.apache.kudu.shaded.org.jboss.netty.handler.timeout.ReadTimeoutHandler.messageReceived(ReadTimeoutHandler.java:184) at org.apache.kudu.shaded.org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70) at org.apache.kudu.shaded.org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564) at org.apache.kudu.shaded.org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791) at org.apache.kudu.shaded.org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:296) at org.apache.kudu.shaded.org.jboss.netty.handler.codec.oneone.OneToOneDecoder.handleUpstream(OneToOneDecoder.java:70) at org.apache.kudu.shaded.org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564) at org.apache.kudu.shaded.org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791) at org.apache.kudu.shaded.org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:296) at org.apache.kudu.shaded.org.jboss.netty.handler.codec.frame.FrameDecoder.unfoldAndFireMessageReceived(FrameDecoder.java:462) at org.apache.kudu.shaded.org.jboss.netty.handler.codec.frame.FrameDecoder.callDecode(FrameDecoder.java:443) at org.apache.kudu.shaded.org.jboss.netty.handler.codec.frame.FrameDecoder.messageReceived(FrameDecoder.java:303) at org.apache.kudu.shaded.org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70) at org.apache.kudu.shaded.org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564) at org.apache.kudu.shaded.org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:559) at org.apache.kudu.shaded.org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:268) at org.apache.kudu.shaded.org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:255) at org.apache.kudu.shaded.org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:88) at org.apache.kudu.shaded.org.jboss.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108) at org.apache.kudu.shaded.org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:337) at org.apache.kudu.shaded.org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89) at org.apache.kudu.shaded.org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178) at org.apache.kudu.shaded.org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108) at org.apache.kudu.shaded.org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) ... 1 more {noformat} > exportAuthenticationCredentials does not retry connectToCluster > --------------------------------------------------------------- > > Key: KUDU-2387 > URL: https://issues.apache.org/jira/browse/KUDU-2387 > Project: Kudu > Issue Type: Bug > Components: java > Affects Versions: 1.7.0 > Reporter: Todd Lipcon > Priority: Critical > > TestSecurity has the following TODO: > {code} > // TODO(todd): it seems that exportAuthenticationCredentials() doesn't > properly retry > // in the case that there is no leader, even though > NoLeaderFoundException is a RecoverableException. > // So, we have to use a hack of calling listTabletServers, which _does_ > properly retry, > // in order to wait for the masters to elect a leader. > {code} > It seems like this causes occasional failures of tests like KuduRDDTest -- I > saw a case where the client failed to connect due to a negotiation timeout, > and then didn't retry at all. It's not clear why the 3-second negotiation > timeout was insufficient in this test case but likely just machine load or > somesuch. -- This message was sent by Atlassian JIRA (v7.6.3#76005)