[ https://issues.apache.org/jira/browse/FLINK-9072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16416926#comment-16416926 ]
liuzhuo commented on FLINK-9072: -------------------------------- I see underscores are not a allowed in hostnames, But some devops can not observe these rules, It is better to show the exact error message about this, otherwise it very hard to location problem > Host name with "_" causes cluster exception > ------------------------------------------- > > Key: FLINK-9072 > URL: https://issues.apache.org/jira/browse/FLINK-9072 > Project: Flink > Issue Type: Bug > Components: Core > Affects Versions: 1.3.2 > Environment: linux: > Linux version 3.10.0-693.2.2.el7.x86_64 (buil...@kbuilder.dev.centos.org) > (gcc version 4.8.5 20150623 (Red Hat 4.8.5-16) (GCC) ) #1 SMP Tue Sep 12 > 22:26:13 UTC 2017 > Java: > 1.8.0_121-b13 > Flink : > flink-1.3.2-bin-hadoop26-scala_2.11 > Reporter: liuzhuo > Priority: Critical > > In my production environment , When I start the cluster, I got errors . > > > {code:java} > 2018-03-21 09:50:42,437 ERROR > org.apache.flink.runtime.webmonitor.files.StaticFileServerHandler - Caught > exception > akka.actor.ActorNotFound: Actor not found for: > ActorSelection[Anchor(akka://flink/deadLetters), Path(/)] > at > akka.actor.ActorSelection$$anonfun$resolveOne$1.apply(ActorSelection.scala:65) > at > akka.actor.ActorSelection$$anonfun$resolveOne$1.apply(ActorSelection.scala:63) > at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32) > at > akka.dispatch.BatchingExecutor$AbstractBatch.processBatch(BatchingExecutor.scala:55) > at akka.dispatch.BatchingExecutor$Batch.run(BatchingExecutor.scala:73) > at > akka.dispatch.ExecutionContexts$sameThreadExecutionContext$.unbatchedExecute(Future.scala:74) > at akka.dispatch.BatchingExecutor$class.execute(BatchingExecutor.scala:120) > at > akka.dispatch.ExecutionContexts$sameThreadExecutionContext$.execute(Future.scala:73) > at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40) > at > scala.concurrent.impl.Promise$DefaultPromise.scala$concurrent$impl$Promise$DefaultPromise$$dispatchOrAddCallback(Promise.scala:280) > at scala.concurrent.impl.Promise$DefaultPromise.onComplete(Promise.scala:270) > at akka.actor.ActorSelection.resolveOne(ActorSelection.scala:63) > at > org.apache.flink.runtime.akka.AkkaUtils$.getActorRefFuture(AkkaUtils.scala:498) > at org.apache.flink.runtime.akka.AkkaUtils.getActorRefFuture(AkkaUtils.scala) > at > org.apache.flink.runtime.webmonitor.JobManagerRetriever.notifyLeaderAddress(JobManagerRetriever.java:141) > at > org.apache.flink.runtime.leaderretrieval.StandaloneLeaderRetrievalService.start(StandaloneLeaderRetrievalService.java:85) > at > org.apache.flink.runtime.webmonitor.WebRuntimeMonitor.start(WebRuntimeMonitor.java:434) > at > org.apache.flink.runtime.jobmanager.JobManager$$anonfun$startJobManagerActors$6.apply(JobManager.scala:2352) > at > org.apache.flink.runtime.jobmanager.JobManager$$anonfun$startJobManagerActors$6.apply(JobManager.scala:2344) > at scala.Option.foreach(Option.scala:257) > at > org.apache.flink.runtime.jobmanager.JobManager$.startJobManagerActors(JobManager.scala:2343) > at > org.apache.flink.runtime.jobmanager.JobManager$.liftedTree3$1(JobManager.scala:2053) > at > org.apache.flink.runtime.jobmanager.JobManager$.runJobManager(JobManager.scala:2052) > at > org.apache.flink.runtime.jobmanager.JobManager$$anonfun$2.apply$mcV$sp(JobManager.scala:2139) > at > org.apache.flink.runtime.jobmanager.JobManager$$anonfun$2.apply(JobManager.scala:2117) > at > org.apache.flink.runtime.jobmanager.JobManager$$anonfun$2.apply(JobManager.scala:2117) > at scala.util.Try$.apply(Try.scala:192) > at > org.apache.flink.runtime.jobmanager.JobManager$.retryOnBindException(JobManager.scala:2172) > at > org.apache.flink.runtime.jobmanager.JobManager$.runJobManager(JobManager.scala:2117) > at > org.apache.flink.runtime.jobmanager.JobManager$$anon$10.call(JobManager.scala:1992) > at > org.apache.flink.runtime.jobmanager.JobManager$$anon$10.call(JobManager.scala:1990) > at > org.apache.flink.runtime.security.HadoopSecurityContext$1.run(HadoopSecurityContext.java:43) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1656) > at > org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:40) > at org.apache.flink.runtime.jobmanager.JobManager$.main(JobManager.scala:1990) > at org.apache.flink.runtime.jobmanager.JobManager.main(JobManager.scala) > 2018-03-21 09:51:23,993 ERROR > org.apache.flink.runtime.clusterframework.standalone.StandaloneResourceManager > - Resource manager could not register at JobManager > akka.pattern.AskTimeoutException: Ask timed out on > [ActorSelection[Anchor(akka://flink/deadLetters), Path(/)]] after [100000 ms] > at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:334) > at akka.actor.Scheduler$$anon$7.run(Scheduler.scala:117) > at > scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:599) > at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109) > at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:597) > at > akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(Scheduler.scala:474) > at > akka.actor.LightArrayRevolverScheduler$$anon$8.executeBucket$1(Scheduler.scala:425) > at > akka.actor.LightArrayRevolverScheduler$$anon$8.nextTick(Scheduler.scala:429) > at akka.actor.LightArrayRevolverScheduler$$anon$8.run(Scheduler.scala:381) > at java.lang.Thread.run(Thread.java:748) > {code} > The error show "akka://flink/deadLetters" > I search it on google , The most answer is the network not work, or 6123 port > is not avaliable, or iptables problems。 > I exclude All above. Finally,I found the different between production > environment and the develop environment . > My develop environment, Hosts like this: > 192.168.xx.xx master1 > 192.168.xx.xx slave1 > 192.168.xx.xx slave2 > > The production environment, hosts like : > 192.168.xx.xx Flink_master > 192.168.xx.xx slaves_01 > 192.168.xx.xx slaves_02 > > when I change the production environment hosts to my develop environment, > remove the "_".the cluster is back to normal > So I guess the host with"_" can not work for Flink cluster > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)