elukey opened a new issue #10209:
URL: https://github.com/apache/druid/issues/10209
Hi everybody,
I am currently testing Druid 0.19.0, I'd like to upgrade the two clusters
that we run in production from 0.12.3 to 0.19.0. So far all my tests are good
(really nice job!) but I noticed a weird behavior when an indexation task kicks
off a map-reduce job that fails for some reason. I get the following logs
repeated over and over, and eventually the middle manager stops:
```
2020-07-24T06:44:34,700 INFO org.apache.druid.indexer.JobHelper: Deleting
path[/tmp/druid-indexing/test_webrequest_new_hadoop/2020-07-24T064311.660Z_5d3b1f48fbc244be97c7dbe42c49bef5]
2020-07-24T06:44:34,999 INFO
org.apache.druid.indexing.worker.executor.ExecutorLifecycle: Task completed
with status: {
"id" :
"index_hadoop_test_webrequest_new_hadoop_cldagnim_2020-07-24T06:43:11.653Z",
"status" : "FAILED",
"duration" : 74867,
"errorMsg" : "{\"attempt_1594733405064_2112_m_000009_1\":\"Error:
org.apache.druid.java.util.common.RE: Failure on ro...",
"location" : {
"host" : null,
"port" : -1,
"tlsPort" : -1
}
}
2020-07-24T06:44:35,008 INFO
org.apache.druid.java.util.common.lifecycle.Lifecycle: Stopping lifecycle
[module] stage [ANNOUNCEMENTS]
2020-07-24T06:44:35,011 INFO
org.apache.druid.java.util.common.lifecycle.Lifecycle: Stopping lifecycle
[module] stage [SERVER]
2020-07-24T06:44:35,016 INFO org.eclipse.jetty.server.AbstractConnector:
Stopped ServerConnector@36f40d72{HTTP/1.1,[http/1.1]}{0.0.0.0:8200}
2020-07-24T06:44:35,016 INFO org.eclipse.jetty.server.session: node0 Stopped
scavenging
2020-07-24T06:44:35,019 INFO
org.eclipse.jetty.server.handler.ContextHandler: Stopped
o.e.j.s.ServletContextHandler@3bc18fec{/,null,UNAVAILABLE}
2020-07-24T06:44:35,021 INFO
org.apache.druid.java.util.common.lifecycle.Lifecycle: Stopping lifecycle
[module] stage [NORMAL]
2020-07-24T06:44:35,022 INFO
org.apache.druid.indexing.overlord.SingleTaskBackgroundRunner: Starting
graceful shutdown of
task[index_hadoop_test_webrequest_new_hadoop_cldagnim_2020-07-24T06:43:11.653Z].
[..]
2020-07-24T06:44:39,457 WARN org.apache.hadoop.ipc.Client: Exception
encountered while connecting to the server : javax.security.sasl.SaslException:
GSS initiate failed [Caused by GSSException: No valid credentials provided
(Mechanism level: Failed to find any Kerberos tgt)]
2020-07-24T06:44:39,464 INFO
org.apache.hadoop.yarn.client.ConfiguredRMFailoverProxyProvider: Failing over
to analytics1029-eqiad-wmnet
2020-07-24T06:44:39,466 WARN org.apache.hadoop.ipc.Client: Failed to connect
to server: analytics1029.eqiad.wmnet/10.64.36.129:8032: retries get failed due
to exceeded maximum allowed retries number: 0
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
~[?:1.8.0_252]
at
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:714)
~[?:1.8.0_252]
at
org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
~[hadoop-common-2.8.5.jar:?]
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:531)
~[hadoop-common-2.8.5.jar:?]
at
org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:685)
~[hadoop-common-2.8.5.jar:?]
at
org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:788)
~[hadoop-common-2.8.5.jar:?]
at org.apache.hadoop.ipc.Client$Connection.access$3500(Client.java:410)
~[hadoop-common-2.8.5.jar:?]
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1550)
~[hadoop-common-2.8.5.jar:?]
at org.apache.hadoop.ipc.Client.call(Client.java:1381)
~[hadoop-common-2.8.5.jar:?]
at org.apache.hadoop.ipc.Client.call(Client.java:1345)
~[hadoop-common-2.8.5.jar:?]
at
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:227)
~[hadoop-common-2.8.5.jar:?]
at
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
~[hadoop-common-2.8.5.jar:?]
at com.sun.proxy.$Proxy318.getApplicationReport(Unknown Source) ~[?:?]
at
org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getApplicationReport(ApplicationClientProtocolPBClientImpl.java:228)
~[hadoop-yarn-common-2.8.5.jar:?]
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
~[?:1.8.0_252]
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
~[?:1.8.0_252]
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
~[?:1.8.0_252]
at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_252]
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:409)
~[hadoop-common-2.8.5.jar:?]
at
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:163)
~[hadoop-common-2.8.5.jar:?]
at
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:155)
~[hadoop-common-2.8.5.jar:?]
at
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
~[hadoop-common-2.8.5.jar:?]
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:346)
~[hadoop-common-2.8.5.jar:?]
at com.sun.proxy.$Proxy319.getApplicationReport(Unknown Source) ~[?:?]
at
org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getApplicationReport(YarnClientImpl.java:480)
~[hadoop-yarn-client-2.8.5.jar:?]
at
org.apache.hadoop.mapred.ResourceMgrDelegate.getApplicationReport(ResourceMgrDelegate.java:314)
~[hadoop-mapreduce-client-jobclient-2.8.5.jar:?]
at
org.apache.hadoop.mapred.ClientServiceDelegate.getProxy(ClientServiceDelegate.java:155)
~[hadoop-mapreduce-client-jobclient-2.8.5.jar:?]
at
org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:324)
~[hadoop-mapreduce-client-jobclient-2.8.5.jar:?]
at
org.apache.hadoop.mapred.ClientServiceDelegate.getJobStatus(ClientServiceDelegate.java:429)
~[hadoop-mapreduce-client-jobclient-2.8.5.jar:?]
at
org.apache.hadoop.mapred.YARNRunner.getJobStatus(YARNRunner.java:617)
~[hadoop-mapreduce-client-jobclient-2.8.5.jar:?]
at org.apache.hadoop.mapreduce.Cluster.getJob(Cluster.java:207)
~[hadoop-mapreduce-client-core-2.8.5.jar:?]
at org.apache.hadoop.mapreduce.tools.CLI.getJob(CLI.java:547)
~[hadoop-mapreduce-client-core-2.8.5.jar:?]
at org.apache.hadoop.mapreduce.tools.CLI.run(CLI.java:304)
~[hadoop-mapreduce-client-core-2.8.5.jar:?]
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
~[hadoop-common-2.8.5.jar:?]
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90)
~[hadoop-common-2.8.5.jar:?]
at
org.apache.druid.indexing.common.task.HadoopIndexTask$HadoopKillMRJobIdProcessingRunner.runTask(HadoopIndexTask.java:768)
~[druid-indexing-service-0.19.0.jar:0.19.0]
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
~[?:1.8.0_252]
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
~[?:1.8.0_252]
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
~[?:1.8.0_252]
at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_252]
at
org.apache.druid.indexing.common.task.HadoopIndexTask.killHadoopJob(HadoopIndexTask.java:492)
~[druid-indexing-service-0.19.0.jar:0.19.0]
at
org.apache.druid.indexing.common.task.HadoopIndexTask.lambda$runInternal$0(HadoopIndexTask.java:311)
~[druid-indexing-service-0.19.0.jar:0.19.0]
at
org.apache.druid.indexing.common.task.TaskResourceCleaner.clean(TaskResourceCleaner.java:50)
[druid-indexing-service-0.19.0.jar:0.19.0]
at
org.apache.druid.indexing.common.task.AbstractBatchIndexTask.stopGracefully(AbstractBatchIndexTask.java:132)
[druid-indexing-service-0.19.0.jar:0.19.0]
at
org.apache.druid.indexing.overlord.SingleTaskBackgroundRunner.stop(SingleTaskBackgroundRunner.java:186)
[druid-indexing-service-0.19.0.jar:0.19.0]
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
~[?:1.8.0_252]
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
~[?:1.8.0_252]
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
~[?:1.8.0_252]
at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_252]
at
org.apache.druid.java.util.common.lifecycle.Lifecycle$AnnotationBasedHandler.stop(Lifecycle.java:465)
[druid-core-0.19.0.jar:0.19.0]
at
org.apache.druid.java.util.common.lifecycle.Lifecycle.stop(Lifecycle.java:368)
[druid-core-0.19.0.jar:0.19.0]
at org.apache.druid.cli.CliPeon.run(CliPeon.java:306)
[druid-services-0.19.0.jar:0.19.0]
at org.apache.druid.cli.Main.main(Main.java:113)
[druid-services-0.19.0.jar:0.19.0]
```
The Kerberos config works very well for all other hadoop-related tasks, like
adding segments, fetching from HDFS, etc.. And the above error does not happen
if the indexation succeeds.
### Affected Version
0.19.0 (upgrading from 0.12.3)
Important note: we still run a CDH 5.16 Hadoop version. It works well with
`-Dhadoop.mapreduce.job.classloader=true` without rebuilding Druid with
different deps.
### Description
Please include as much detailed information about the problem as possible.
- Cluster size
- Configurations in use
- Steps to reproduce the problem
- The error message or stack traces encountered. Providing more context,
such as nearby log messages or even entire logs, can be helpful.
- Any debugging that you have already done
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]