[GitHub] [druid] elukey opened a new issue #10209: Kerberos GSS errors reported when an indexation task map-reduce job fails

GitBox Fri, 24 Jul 2020 02:45:18 -0700


elukey opened a new issue #10209:
URL: https://github.com/apache/druid/issues/10209



   Hi everybody,
   
   I am currently testing Druid 0.19.0, I'd like to upgrade the two clusters 
that we run in production from 0.12.3 to 0.19.0. So far all my tests are good 
(really nice job!) but I noticed a weird behavior when an indexation task kicks 
off a map-reduce job that fails for some reason. I get the following logs 
repeated over and over, and eventually the middle manager stops:
   
   ```
   2020-07-24T06:44:34,700 INFO org.apache.druid.indexer.JobHelper: Deleting 
path[/tmp/druid-indexing/test_webrequest_new_hadoop/2020-07-24T064311.660Z_5d3b1f48fbc244be97c7dbe42c49bef5]
   2020-07-24T06:44:34,999 INFO 
org.apache.druid.indexing.worker.executor.ExecutorLifecycle: Task completed 
with status: {
     "id" : 
"index_hadoop_test_webrequest_new_hadoop_cldagnim_2020-07-24T06:43:11.653Z",
     "status" : "FAILED",
     "duration" : 74867,
     "errorMsg" : "{\"attempt_1594733405064_2112_m_000009_1\":\"Error: 
org.apache.druid.java.util.common.RE: Failure on ro...",
     "location" : {
       "host" : null,
       "port" : -1,
       "tlsPort" : -1
     }
   }
   2020-07-24T06:44:35,008 INFO 
org.apache.druid.java.util.common.lifecycle.Lifecycle: Stopping lifecycle 
[module] stage [ANNOUNCEMENTS]
   2020-07-24T06:44:35,011 INFO 
org.apache.druid.java.util.common.lifecycle.Lifecycle: Stopping lifecycle 
[module] stage [SERVER]
   2020-07-24T06:44:35,016 INFO org.eclipse.jetty.server.AbstractConnector: 
Stopped ServerConnector@36f40d72{HTTP/1.1,[http/1.1]}{0.0.0.0:8200}
   2020-07-24T06:44:35,016 INFO org.eclipse.jetty.server.session: node0 Stopped 
scavenging
   2020-07-24T06:44:35,019 INFO 
org.eclipse.jetty.server.handler.ContextHandler: Stopped 
o.e.j.s.ServletContextHandler@3bc18fec{/,null,UNAVAILABLE}
   2020-07-24T06:44:35,021 INFO 
org.apache.druid.java.util.common.lifecycle.Lifecycle: Stopping lifecycle 
[module] stage [NORMAL]
   2020-07-24T06:44:35,022 INFO 
org.apache.druid.indexing.overlord.SingleTaskBackgroundRunner: Starting 
graceful shutdown of 
task[index_hadoop_test_webrequest_new_hadoop_cldagnim_2020-07-24T06:43:11.653Z].
   
   [..]
   
   2020-07-24T06:44:39,457 WARN org.apache.hadoop.ipc.Client: Exception 
encountered while connecting to the server : javax.security.sasl.SaslException: 
GSS initiate failed [Caused by GSSException: No valid credentials provided 
(Mechanism level: Failed to find any Kerberos tgt)]
   2020-07-24T06:44:39,464 INFO 
org.apache.hadoop.yarn.client.ConfiguredRMFailoverProxyProvider: Failing over 
to analytics1029-eqiad-wmnet
   2020-07-24T06:44:39,466 WARN org.apache.hadoop.ipc.Client: Failed to connect 
to server: analytics1029.eqiad.wmnet/10.64.36.129:8032: retries get failed due 
to exceeded maximum allowed retries number: 0
   java.net.ConnectException: Connection refused
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) 
~[?:1.8.0_252]
        at 
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:714) 
~[?:1.8.0_252]
        at 
org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206) 
~[hadoop-common-2.8.5.jar:?]
        at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:531) 
~[hadoop-common-2.8.5.jar:?]
        at 
org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:685) 
~[hadoop-common-2.8.5.jar:?]
        at 
org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:788) 
~[hadoop-common-2.8.5.jar:?]
        at org.apache.hadoop.ipc.Client$Connection.access$3500(Client.java:410) 
~[hadoop-common-2.8.5.jar:?]
        at org.apache.hadoop.ipc.Client.getConnection(Client.java:1550) 
~[hadoop-common-2.8.5.jar:?]
        at org.apache.hadoop.ipc.Client.call(Client.java:1381) 
~[hadoop-common-2.8.5.jar:?]
        at org.apache.hadoop.ipc.Client.call(Client.java:1345) 
~[hadoop-common-2.8.5.jar:?]
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:227)
 ~[hadoop-common-2.8.5.jar:?]
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
 ~[hadoop-common-2.8.5.jar:?]
        at com.sun.proxy.$Proxy318.getApplicationReport(Unknown Source) ~[?:?]
        at 
org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getApplicationReport(ApplicationClientProtocolPBClientImpl.java:228)
 ~[hadoop-yarn-common-2.8.5.jar:?]
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
~[?:1.8.0_252]
        at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
~[?:1.8.0_252]
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 ~[?:1.8.0_252]
        at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_252]
        at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:409)
 ~[hadoop-common-2.8.5.jar:?]
        at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:163)
 ~[hadoop-common-2.8.5.jar:?]
        at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:155)
 ~[hadoop-common-2.8.5.jar:?]
        at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
 ~[hadoop-common-2.8.5.jar:?]
        at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:346)
 ~[hadoop-common-2.8.5.jar:?]
        at com.sun.proxy.$Proxy319.getApplicationReport(Unknown Source) ~[?:?]
        at 
org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getApplicationReport(YarnClientImpl.java:480)
 ~[hadoop-yarn-client-2.8.5.jar:?]
        at 
org.apache.hadoop.mapred.ResourceMgrDelegate.getApplicationReport(ResourceMgrDelegate.java:314)
 ~[hadoop-mapreduce-client-jobclient-2.8.5.jar:?]
        at 
org.apache.hadoop.mapred.ClientServiceDelegate.getProxy(ClientServiceDelegate.java:155)
 ~[hadoop-mapreduce-client-jobclient-2.8.5.jar:?]
        at 
org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:324)
 ~[hadoop-mapreduce-client-jobclient-2.8.5.jar:?]
        at 
org.apache.hadoop.mapred.ClientServiceDelegate.getJobStatus(ClientServiceDelegate.java:429)
 ~[hadoop-mapreduce-client-jobclient-2.8.5.jar:?]
        at 
org.apache.hadoop.mapred.YARNRunner.getJobStatus(YARNRunner.java:617) 
~[hadoop-mapreduce-client-jobclient-2.8.5.jar:?]
        at org.apache.hadoop.mapreduce.Cluster.getJob(Cluster.java:207) 
~[hadoop-mapreduce-client-core-2.8.5.jar:?]
        at org.apache.hadoop.mapreduce.tools.CLI.getJob(CLI.java:547) 
~[hadoop-mapreduce-client-core-2.8.5.jar:?]
        at org.apache.hadoop.mapreduce.tools.CLI.run(CLI.java:304) 
~[hadoop-mapreduce-client-core-2.8.5.jar:?]
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) 
~[hadoop-common-2.8.5.jar:?]
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90) 
~[hadoop-common-2.8.5.jar:?]
        at 
org.apache.druid.indexing.common.task.HadoopIndexTask$HadoopKillMRJobIdProcessingRunner.runTask(HadoopIndexTask.java:768)
 ~[druid-indexing-service-0.19.0.jar:0.19.0]
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
~[?:1.8.0_252]
        at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
~[?:1.8.0_252]
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 ~[?:1.8.0_252]
        at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_252]
        at 
org.apache.druid.indexing.common.task.HadoopIndexTask.killHadoopJob(HadoopIndexTask.java:492)
 ~[druid-indexing-service-0.19.0.jar:0.19.0]
        at 
org.apache.druid.indexing.common.task.HadoopIndexTask.lambda$runInternal$0(HadoopIndexTask.java:311)
 ~[druid-indexing-service-0.19.0.jar:0.19.0]
        at 
org.apache.druid.indexing.common.task.TaskResourceCleaner.clean(TaskResourceCleaner.java:50)
 [druid-indexing-service-0.19.0.jar:0.19.0]
        at 
org.apache.druid.indexing.common.task.AbstractBatchIndexTask.stopGracefully(AbstractBatchIndexTask.java:132)
 [druid-indexing-service-0.19.0.jar:0.19.0]
        at 
org.apache.druid.indexing.overlord.SingleTaskBackgroundRunner.stop(SingleTaskBackgroundRunner.java:186)
 [druid-indexing-service-0.19.0.jar:0.19.0]
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
~[?:1.8.0_252]
        at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
~[?:1.8.0_252]
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 ~[?:1.8.0_252]
        at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_252]
        at 
org.apache.druid.java.util.common.lifecycle.Lifecycle$AnnotationBasedHandler.stop(Lifecycle.java:465)
 [druid-core-0.19.0.jar:0.19.0]
        at 
org.apache.druid.java.util.common.lifecycle.Lifecycle.stop(Lifecycle.java:368) 
[druid-core-0.19.0.jar:0.19.0]
        at org.apache.druid.cli.CliPeon.run(CliPeon.java:306) 
[druid-services-0.19.0.jar:0.19.0]
        at org.apache.druid.cli.Main.main(Main.java:113) 
[druid-services-0.19.0.jar:0.19.0]
   ```
   
   The Kerberos config works very well for all other hadoop-related tasks, like 
adding segments, fetching from HDFS, etc.. And the above error does not happen 
if the indexation succeeds.
   
   ### Affected Version
   
   0.19.0 (upgrading from 0.12.3)
   Important note: we still run a CDH 5.16 Hadoop version. It works well with 
`-Dhadoop.mapreduce.job.classloader=true` without rebuilding Druid with 
different deps.
   
   ### Description
   
   Please include as much detailed information about the problem as possible.
   - Cluster size
   - Configurations in use
   - Steps to reproduce the problem
   - The error message or stack traces encountered. Providing more context, 
such as nearby log messages or even entire logs, can be helpful.
   - Any debugging that you have already done
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [druid] elukey opened a new issue #10209: Kerberos GSS errors reported when an indexation task map-reduce job fails

Reply via email to