[jira] [Commented] (FLINK-10368) 'Kerberized YARN on Docker test' unstable

Till Rohrmann (JIRA) Wed, 21 Nov 2018 03:54:14 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-10368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16694596#comment-16694596
 ]


Till Rohrmann commented on FLINK-10368:
---------------------------------------

The test seems still to be unstable. It failed running it on an AWS instance:
{code}
Successfully built 48a8281421be
Starting Hadoop cluster
Creating network "docker-hadoop-cluster-network" with the default driver
Creating kdc ... done
Creating master ... done
Creating slave2 ... done
Creating slave1 ... done
Waiting for hadoop cluster to come up. We have been trying for 0 seconds, 
retrying ...
Waiting for hadoop cluster to come up. We have been trying for 10 seconds, 
retrying ...
Waiting for hadoop cluster to come up. We have been trying for 20 seconds, 
retrying ...
Waiting for hadoop cluster to come up. We have been trying for 30 seconds, 
retrying ...
Waiting for hadoop cluster to come up. We have been trying for 41 seconds, 
retrying ...
Waiting for hadoop cluster to come up. We have been trying for 51 seconds, 
retrying ...
Waiting for hadoop cluster to come up. We have been trying for 61 seconds, 
retrying ...
Waiting for hadoop cluster to come up. We have been trying for 71 seconds, 
retrying ...
Waiting for hadoop cluster to come up. We have been trying for 81 seconds, 
retrying ...
Waiting for hadoop cluster to come up. We have been trying for 91 seconds, 
retrying ...
Waiting for hadoop cluster to come up. We have been trying for 101 seconds, 
retrying ...
Waiting for hadoop cluster to come up. We have been trying for 111 seconds, 
retrying ...
ERROR: Could not start hadoop cluster. Retrying...
Stopping slave1 ... done
Stopping slave2 ... done
Stopping master ... done
Stopping kdc    ... done
Removing slave1 ... done
Removing slave2 ... done
Removing master ... done
Removing kdc    ... done
Removing network docker-hadoop-cluster-network
Starting Hadoop cluster
Creating network "docker-hadoop-cluster-network" with the default driver
Creating kdc ... done
Creating master ... done
Creating slave2 ... done
Creating slave1 ... done
Waiting for hadoop cluster to come up. We have been trying for 1 seconds, 
retrying ...
Waiting for hadoop cluster to come up. We have been trying for 11 seconds, 
retrying ...
Waiting for hadoop cluster to come up. We have been trying for 21 seconds, 
retrying ...
Waiting for hadoop cluster to come up. We have been trying for 31 seconds, 
retrying ...
Waiting for hadoop cluster to come up. We have been trying for 41 seconds, 
retrying ...
Waiting for hadoop cluster to come up. We have been trying for 51 seconds, 
retrying ...
Waiting for hadoop cluster to come up. We have been trying for 61 seconds, 
retrying ...
Waiting for hadoop cluster to come up. We have been trying for 71 seconds, 
retrying ...
Waiting for hadoop cluster to come up. We have been trying for 81 seconds, 
retrying ...
Waiting for hadoop cluster to come up. We have been trying for 91 seconds, 
retrying ...
Waiting for hadoop cluster to come up. We have been trying for 101 seconds, 
retrying ...
Waiting for hadoop cluster to come up. We have been trying for 111 seconds, 
retrying ...
ERROR: Could not start hadoop cluster. Retrying...
Stopping slave1 ... done
Stopping slave2 ... done
Stopping master ... done
Stopping kdc    ... done
Removing slave1 ... done
Removing slave2 ... done
Removing master ... done
Removing kdc    ... done
Removing network docker-hadoop-cluster-network
Starting Hadoop cluster
Creating network "docker-hadoop-cluster-network" with the default driver
Creating kdc ... done
Creating master ... done
Creating slave2 ... done
Creating slave1 ... done
Waiting for hadoop cluster to come up. We have been trying for 0 seconds, 
retrying ...
Waiting for hadoop cluster to come up. We have been trying for 10 seconds, 
retrying ...
Waiting for hadoop cluster to come up. We have been trying for 20 seconds, 
retrying ...
Waiting for hadoop cluster to come up. We have been trying for 30 seconds, 
retrying ...
Waiting for hadoop cluster to come up. We have been trying for 41 seconds, 
retrying ...
Waiting for hadoop cluster to come up. We have been trying for 51 seconds, 
retrying ...
Waiting for hadoop cluster to come up. We have been trying for 61 seconds, 
retrying ...
Waiting for hadoop cluster to come up. We have been trying for 71 seconds, 
retrying ...
Waiting for hadoop cluster to come up. We have been trying for 81 seconds, 
retrying ...
Waiting for hadoop cluster to come up. We have been trying for 91 seconds, 
retrying ...
Waiting for hadoop cluster to come up. We have been trying for 101 seconds, 
retrying ...
Waiting for hadoop cluster to come up. We have been trying for 111 seconds, 
retrying ...
ERROR: Could not start hadoop cluster. Retrying...
Stopping slave1 ... done
Stopping slave2 ... done
Stopping master ... done
Stopping kdc    ... done
Removing slave1 ... done
Removing slave2 ... done
Removing master ... done
Removing kdc    ... done
Removing network docker-hadoop-cluster-network
ERROR: Could not start hadoop cluster. Aborting...
Removing network docker-hadoop-cluster-network
WARNING: Network docker-hadoop-cluster-network not found.
[FAIL] Test script contains errors.
Checking for errors...
No errors in log files.
Checking for exceptions...
No exceptions in log files.
Checking for non-empty .out files...
grep: /home/admin/flink-1.7.0/log/*.out: No such file or directory
No non-empty .out files.

[FAIL] 'Running Kerberized YARN on Docker test ' failed after 15 minutes and 33 
seconds! Test exited with exit code 1
{code}

> 'Kerberized YARN on Docker test' unstable
> -----------------------------------------
>
>                 Key: FLINK-10368
>                 URL: https://issues.apache.org/jira/browse/FLINK-10368
>             Project: Flink
>          Issue Type: Bug
>          Components: Tests
>    Affects Versions: 1.5.3, 1.6.0, 1.7.0
>            Reporter: Till Rohrmann
>            Assignee: Dawid Wysakowicz
>            Priority: Critical
>              Labels: pull-request-available, test-stability
>             Fix For: 1.5.6, 1.6.3, 1.7.0
>
>
> Running Kerberized YARN on Docker test end-to-end test failed on an AWS 
> instance. The problem seems to be that the NameNode went into safe-mode due 
> to limited resources.
> {code}
> SLF4J: Class path contains multiple SLF4J bindings.
> SLF4J: Found binding in 
> [jar:file:/home/hadoop-user/flink-1.6.1/lib/slf4j-log4j12-1.7.7.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in 
> [jar:file:/usr/local/hadoop-2.8.4/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an 
> explanation.
> SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
> 2018-09-19 09:04:39,201 INFO  org.apache.hadoop.security.UserGroupInformation 
>               - Login successful for user hadoop-user using keytab file 
> /home/hadoop-user/hadoop-user.keytab
> 2018-09-19 09:04:39,453 INFO  org.apache.hadoop.yarn.client.RMProxy           
>               - Connecting to ResourceManager at 
> master.docker-hadoop-cluster-network/172.22.0.3:8032
> 2018-09-19 09:04:39,640 INFO  org.apache.hadoop.yarn.client.AHSProxy          
>               - Connecting to Application History server at 
> master.docker-hadoop-cluster-network/172.22.0.3:10200
> 2018-09-19 09:04:39,656 INFO  org.apache.flink.yarn.cli.FlinkYarnSessionCli   
>               - No path for the flink jar passed. Using the location of class 
> org.apache.flink.yarn.YarnClusterDescriptor to locate the jar
> 2018-09-19 09:04:39,656 INFO  org.apache.flink.yarn.cli.FlinkYarnSessionCli   
>               - No path for the flink jar passed. Using the location of class 
> org.apache.flink.yarn.YarnClusterDescriptor to locate the jar
> 2018-09-19 09:04:39,901 INFO  
> org.apache.flink.yarn.AbstractYarnClusterDescriptor           - Cluster 
> specification: ClusterSpecification{masterMemoryMB=2000, 
> taskManagerMemoryMB=2000, numberTaskManagers=3, slotsPerTaskManager=1}
> 2018-09-19 09:04:40,286 WARN  
> org.apache.flink.yarn.AbstractYarnClusterDescriptor           - The 
> configuration directory ('/home/hadoop-user/flink-1.6.1/conf') contains both 
> LOG4J and Logback configuration files. Please delete or rename one of them.
> ------------------------------------------------------------
>  The program finished with the following exception:
> org.apache.flink.client.deployment.ClusterDeploymentException: Couldn't 
> deploy Yarn session cluster
>         at 
> org.apache.flink.yarn.AbstractYarnClusterDescriptor.deploySessionCluster(AbstractYarnClusterDescriptor.java:420)
>         at 
> org.apache.flink.client.cli.CliFrontend.runProgram(CliFrontend.java:259)
>         at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:215)
>         at 
> org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1044)
>         at 
> org.apache.flink.client.cli.CliFrontend.lambda$main$11(CliFrontend.java:1120)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:422)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1836)
>         at 
> org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
>         at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1120)
> Caused by: org.apache.hadoop.hdfs.server.namenode.SafeModeException: Cannot 
> create 
> file/user/hadoop-user/.flink/application_1537266361291_0099/lib/slf4j-log4j12-1.7.7.jar.
>  Name node is in safe mode.
> Resources are low on NN. Please add or free up more resources then turn off 
> safe mode manually. NOTE:  If you turn off safe mode before adding resources, 
> the NN will immediately return to safe mode. Use "hdfs dfsadmin -safemode 
> leave" to turn safe mode off. 
> NamenodeHostName:master.docker-hadoop-cluster-network
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.newSafemodeException(FSNamesystem.java:1407)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkNameNodeSafeMode(FSNamesystem.java:1395)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2278)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:2223)
>         at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpcServer.java:728)
>         at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.create(ClientNamenodeProtocolServerSideTranslatorPB.java:413)
>         at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:447)
>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989)
>         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:850)
>         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:793)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:422)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1840)
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2489)
>         at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method)
>         at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>         at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>         at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>         at 
> org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:121)
>         at 
> org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:88)
>         at 
> org.apache.hadoop.hdfs.DFSOutputStream.newStreamForCreate(DFSOutputStream.java:270)
>         at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1274)
>         at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1216)
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem$8.doCall(DistributedFileSystem.java:473)
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem$8.doCall(DistributedFileSystem.java:470)
>         at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:470)
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:411)
>         at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:929)
>         at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:910)
>         at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:807)
>         at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:368)
>         at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:341)
>         at 
> org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:2002)
>         at org.apache.flink.yarn.Utils.setupLocalResource(Utils.java:162)
>         at 
> org.apache.flink.yarn.AbstractYarnClusterDescriptor.setupSingleLocalResource(AbstractYarnClusterDescriptor.java:1139)
>         at 
> org.apache.flink.yarn.AbstractYarnClusterDescriptor.access$000(AbstractYarnClusterDescriptor.java:111)
>         at 
> org.apache.flink.yarn.AbstractYarnClusterDescriptor$1.visitFile(AbstractYarnClusterDescriptor.java:1200)
>         at 
> org.apache.flink.yarn.AbstractYarnClusterDescriptor$1.visitFile(AbstractYarnClusterDescriptor.java:1188)
>         at java.nio.file.Files.walkFileTree(Files.java:2670)
>         at java.nio.file.Files.walkFileTree(Files.java:2742)
>         at 
> org.apache.flink.yarn.AbstractYarnClusterDescriptor.uploadAndRegisterFiles(AbstractYarnClusterDescriptor.java:1188)
>         at 
> org.apache.flink.yarn.AbstractYarnClusterDescriptor.startAppMaster(AbstractYarnClusterDescriptor.java:800)
>         at 
> org.apache.flink.yarn.AbstractYarnClusterDescriptor.deployInternal(AbstractYarnClusterDescriptor.java:542)
>         at 
> org.apache.flink.yarn.AbstractYarnClusterDescriptor.deploySessionCluster(AbstractYarnClusterDescriptor.java:413)
>         ... 9 more
> Caused by: 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.SafeModeException):
>  Cannot create 
> file/user/hadoop-user/.flink/application_1537266361291_0099/lib/slf4j-log4j12-1.7.7.jar.
>  Name node is in safe mode.
> Resources are low on NN. Please add or free up more resources then turn off 
> safe mode manually. NOTE:  If you turn off safe mode before adding resources, 
> the NN will immediately return to safe mode. Use "hdfs dfsadmin -safemode 
> leave" to turn safe mode off. 
> NamenodeHostName:master.docker-hadoop-cluster-network
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.newSafemodeException(FSNamesystem.java:1407)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkNameNodeSafeMode(FSNamesystem.java:1395)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2278)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:2223)
>         at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpcServer.java:728)
>         at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.create(ClientNamenodeProtocolServerSideTranslatorPB.java:413)
>         at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:447)
>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989)
>         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:850)
>         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:793)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:422)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1840)
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2489)
>         at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1489)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1435)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1345)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:227)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
>         at com.sun.proxy.$Proxy14.create(Unknown Source)
>         at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.create(ClientNamenodeProtocolTranslatorPB.java:297)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>         at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:498)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:409)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:163)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:155)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:346)
>         at com.sun.proxy.$Proxy15.create(Unknown Source)
>         at 
> org.apache.hadoop.hdfs.DFSOutputStream.newStreamForCreate(DFSOutputStream.java:265)
>         ... 33 more
> Running the Flink job failed, might be that the cluster is not ready yet. We 
> have been trying for 795 seconds, retrying ...
> {code}
> I think it would be good to harden the test.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (FLINK-10368) 'Kerberized YARN on Docker test' unstable

Reply via email to