[ https://issues.apache.org/jira/browse/HDDS-609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16646970#comment-16646970 ]
Hanisha Koneru commented on HDDS-609: ------------------------------------- Initial error logs reported by [~nmaheshwari] (moved from the description): {code:java} -bash-4.2$ /usr/hdp/current/hadoop-client/bin/hadoop jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-mapreduce-examples.jar wordcount /tmp/mr_jobs/input/ o3://bucket2.volume2/mr_job5 18/10/09 23:37:07 INFO conf.Configuration: Removed undeclared tags: 18/10/09 23:37:08 INFO conf.Configuration: Removed undeclared tags: 18/10/09 23:37:08 INFO conf.Configuration: Removed undeclared tags: 18/10/09 23:37:08 INFO client.AHSProxy: Connecting to Application History server at ctr-e138-1518143905142-510793-01-000004.hwx.site/172.27.79.197:10200 18/10/09 23:37:08 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm2 18/10/09 23:37:09 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /user/hdfs/.staging/job_1539125785626_0007 18/10/09 23:37:09 INFO input.FileInputFormat: Total input files to process : 1 18/10/09 23:37:09 INFO lzo.GPLNativeCodeLoader: Loaded native gpl library 18/10/09 23:37:09 INFO lzo.LzoCodec: Successfully loaded & initialized native-lzo library [hadoop-lzo rev 5d6248d8d690f8456469979213ab2e9993bfa2e9] 18/10/09 23:37:09 INFO mapreduce.JobSubmitter: number of splits:1 18/10/09 23:37:09 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1539125785626_0007 18/10/09 23:37:09 INFO mapreduce.JobSubmitter: Executing with tokens: [] 18/10/09 23:37:10 INFO conf.Configuration: Removed undeclared tags: 18/10/09 23:37:10 INFO conf.Configuration: found resource resource-types.xml at file:/etc/hadoop/3.0.3.0-63/0/resource-types.xml 18/10/09 23:37:10 INFO conf.Configuration: Removed undeclared tags: 18/10/09 23:37:10 INFO impl.YarnClientImpl: Submitted application application_1539125785626_0007 18/10/09 23:37:10 INFO mapreduce.Job: The url to track the job: http://ctr-e138-1518143905142-510793-01-000005.hwx.site:8088/proxy/application_1539125785626_0007/ 18/10/09 23:37:10 INFO mapreduce.Job: Running job: job_1539125785626_0007 18/10/09 23:37:17 INFO mapreduce.Job: Job job_1539125785626_0007 running in uber mode : false 18/10/09 23:37:17 INFO mapreduce.Job: map 0% reduce 0% 18/10/09 23:37:24 INFO mapreduce.Job: map 100% reduce 0% 18/10/09 23:37:29 INFO mapreduce.Job: Task Id : attempt_1539125785626_0007_r_000000_0, Status : FAILED Error: java.io.IOException: Allocate block failed, error:INTERNAL_ERROR at org.apache.hadoop.ozone.om.protocolPB.OzoneManagerProtocolClientSideTranslatorPB.allocateBlock(OzoneManagerProtocolClientSideTranslatorPB.java:576) at org.apache.hadoop.ozone.client.io.ChunkGroupOutputStream.allocateNewBlock(ChunkGroupOutputStream.java:475) at org.apache.hadoop.ozone.client.io.ChunkGroupOutputStream.handleWrite(ChunkGroupOutputStream.java:271) at org.apache.hadoop.ozone.client.io.ChunkGroupOutputStream.write(ChunkGroupOutputStream.java:250) at org.apache.hadoop.fs.ozone.OzoneFSOutputStream.write(OzoneFSOutputStream.java:47) at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:57) at java.io.DataOutputStream.write(DataOutputStream.java:107) at org.apache.hadoop.mapreduce.lib.output.TextOutputFormat$LineRecordWriter.writeObject(TextOutputFormat.java:78) at org.apache.hadoop.mapreduce.lib.output.TextOutputFormat$LineRecordWriter.write(TextOutputFormat.java:93) at org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.write(ReduceTask.java:559) at org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89) at org.apache.hadoop.mapreduce.lib.reduce.WrappedReducer$Context.write(WrappedReducer.java:105) at org.apache.hadoop.examples.WordCount$IntSumReducer.reduce(WordCount.java:64) at org.apache.hadoop.examples.WordCount$IntSumReducer.reduce(WordCount.java:52) at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:171) at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:628) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:390) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:174) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:168) 18/10/09 23:37:35 INFO mapreduce.Job: Task Id : attempt_1539125785626_0007_r_000000_1, Status : FAILED Error: java.io.IOException: Allocate block failed, error:INTERNAL_ERROR at org.apache.hadoop.ozone.om.protocolPB.OzoneManagerProtocolClientSideTranslatorPB.allocateBlock(OzoneManagerProtocolClientSideTranslatorPB.java:576) at org.apache.hadoop.ozone.client.io.ChunkGroupOutputStream.allocateNewBlock(ChunkGroupOutputStream.java:475) at org.apache.hadoop.ozone.client.io.ChunkGroupOutputStream.handleWrite(ChunkGroupOutputStream.java:271) at org.apache.hadoop.ozone.client.io.ChunkGroupOutputStream.write(ChunkGroupOutputStream.java:250) at org.apache.hadoop.fs.ozone.OzoneFSOutputStream.write(OzoneFSOutputStream.java:47) at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:57) at java.io.DataOutputStream.write(DataOutputStream.java:107) at org.apache.hadoop.mapreduce.lib.output.TextOutputFormat$LineRecordWriter.writeObject(TextOutputFormat.java:78) at org.apache.hadoop.mapreduce.lib.output.TextOutputFormat$LineRecordWriter.write(TextOutputFormat.java:93) at org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.write(ReduceTask.java:559) at org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89) at org.apache.hadoop.mapreduce.lib.reduce.WrappedReducer$Context.write(WrappedReducer.java:105) at org.apache.hadoop.examples.WordCount$IntSumReducer.reduce(WordCount.java:64) at org.apache.hadoop.examples.WordCount$IntSumReducer.reduce(WordCount.java:52) at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:171) at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:628) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:390) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:174) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:168) 18/10/09 23:37:42 INFO mapreduce.Job: Task Id : attempt_1539125785626_0007_r_000000_2, Status : FAILED Error: java.io.IOException: Allocate block failed, error:INTERNAL_ERROR at org.apache.hadoop.ozone.om.protocolPB.OzoneManagerProtocolClientSideTranslatorPB.allocateBlock(OzoneManagerProtocolClientSideTranslatorPB.java:576) at org.apache.hadoop.ozone.client.io.ChunkGroupOutputStream.allocateNewBlock(ChunkGroupOutputStream.java:475) at org.apache.hadoop.ozone.client.io.ChunkGroupOutputStream.handleWrite(ChunkGroupOutputStream.java:271) at org.apache.hadoop.ozone.client.io.ChunkGroupOutputStream.write(ChunkGroupOutputStream.java:250) at org.apache.hadoop.fs.ozone.OzoneFSOutputStream.write(OzoneFSOutputStream.java:47) at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:57) at java.io.DataOutputStream.write(DataOutputStream.java:107) at org.apache.hadoop.mapreduce.lib.output.TextOutputFormat$LineRecordWriter.writeObject(TextOutputFormat.java:78) at org.apache.hadoop.mapreduce.lib.output.TextOutputFormat$LineRecordWriter.write(TextOutputFormat.java:93) at org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.write(ReduceTask.java:559) at org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89) at org.apache.hadoop.mapreduce.lib.reduce.WrappedReducer$Context.write(WrappedReducer.java:105) at org.apache.hadoop.examples.WordCount$IntSumReducer.reduce(WordCount.java:64) at org.apache.hadoop.examples.WordCount$IntSumReducer.reduce(WordCount.java:52) at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:171) at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:628) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:390) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:174) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:168) 18/10/09 23:37:49 INFO mapreduce.Job: map 100% reduce 100% 18/10/09 23:37:50 INFO mapreduce.Job: Job job_1539125785626_0007 failed with state FAILED due to: Task failed task_1539125785626_0007_r_000000 Job failed as tasks failed. failedMaps:0 failedReduces:1 killedMaps:0 killedReduces: 0 18/10/09 23:37:50 INFO conf.Configuration: Removed undeclared tags: 18/10/09 23:37:51 INFO mapreduce.Job: Counters: 45 File System Counters FILE: Number of bytes read=0 FILE: Number of bytes written=266505 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=215876 HDFS: Number of bytes written=0 HDFS: Number of read operations=2 HDFS: Number of large read operations=0 HDFS: Number of write operations=0 O3: Number of bytes read=0 O3: Number of bytes written=0 O3: Number of read operations=0 O3: Number of large read operations=0 O3: Number of write operations=0 Job Counters Failed reduce tasks=4 Launched map tasks=1 Launched reduce tasks=4 Rack-local map tasks=1 Total time spent by all maps in occupied slots (ms)=18816 Total time spent by all reduces in occupied slots (ms)=128016 Total time spent by all map tasks (ms)=4704 Total time spent by all reduce tasks (ms)=16002 Total vcore-milliseconds taken by all map tasks=4704 Total vcore-milliseconds taken by all reduce tasks=16002 Total megabyte-milliseconds taken by all map tasks=19267584 Total megabyte-milliseconds taken by all reduce tasks=131088384 Map-Reduce Framework Map input records=716 Map output records=32019 Map output bytes=343475 Map output materialized bytes=6332 Input split bytes=121 Combine input records=32019 Combine output records=461 Spilled Records=461 Failed Shuffles=0 Merged Map outputs=0 GC time elapsed (ms)=96 CPU time spent (ms)=3300 Physical memory (bytes) snapshot=2528358400 Virtual memory (bytes) snapshot=5420421120 Total committed heap usage (bytes)=2714238976 Peak Map Physical memory (bytes)=2528358400 Peak Map Virtual memory (bytes)=5420421120 File Input Format Counters Bytes Read=215755 18/10/09 23:37:51 INFO conf.Configuration: Removed undeclared tags: {code} SCM logs {code:java} 2018-10-09 23:37:28,984 INFO org.apache.hadoop.ipc.Server: IPC Server handler 4 on 9863, call Call#101 Retry#0 org.apache.hadoop.ozone.protocol.ScmBlockLocationProtocol.allocateScmBlock from 172.27.56.9:33814 org.apache.hadoop.hdds.scm.exceptions.SCMException: ChillModePrecheck failed for allocateBlock at org.apache.hadoop.hdds.scm.server.ChillModePrecheck.check(ChillModePrecheck.java:38) at org.apache.hadoop.hdds.scm.server.ChillModePrecheck.check(ChillModePrecheck.java:30) at org.apache.hadoop.hdds.scm.ScmUtils.preCheck(ScmUtils.java:42) at org.apache.hadoop.hdds.scm.block.BlockManagerImpl.allocateBlock(BlockManagerImpl.java:191) at org.apache.hadoop.hdds.scm.server.SCMBlockProtocolServer.allocateBlock(SCMBlockProtocolServer.java:143) at org.apache.hadoop.ozone.protocolPB.ScmBlockLocationProtocolServerSideTranslatorPB.allocateScmBlock(ScmBlockLocationProtocolServerSideTranslatorPB.java:74) at org.apache.hadoop.hdds.protocol.proto.ScmBlockLocationProtocolProtos$ScmBlockLocationProtocolService$2.callBlockingMethod(ScmBlockLocationProtocolProtos.java:6255) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:876) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:822) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2682) 2018-10-09 23:37:35,232 INFO org.apache.hadoop.ipc.Server: IPC Server handler 4 on 9863, call Call#103 Retry#0 org.apache.hadoop.ozone.protocol.ScmBlockLocationProtocol.allocateScmBlock from 172.27.56.9:33814 org.apache.hadoop.hdds.scm.exceptions.SCMException: ChillModePrecheck failed for allocateBlock at org.apache.hadoop.hdds.scm.server.ChillModePrecheck.check(ChillModePrecheck.java:38) at org.apache.hadoop.hdds.scm.server.ChillModePrecheck.check(ChillModePrecheck.java:30) at org.apache.hadoop.hdds.scm.ScmUtils.preCheck(ScmUtils.java:42) at org.apache.hadoop.hdds.scm.block.BlockManagerImpl.allocateBlock(BlockManagerImpl.java:191) at org.apache.hadoop.hdds.scm.server.SCMBlockProtocolServer.allocateBlock(SCMBlockProtocolServer.java:143) at org.apache.hadoop.ozone.protocolPB.ScmBlockLocationProtocolServerSideTranslatorPB.allocateScmBlock(ScmBlockLocationProtocolServerSideTranslatorPB.java:74) at org.apache.hadoop.hdds.protocol.proto.ScmBlockLocationProtocolProtos$ScmBlockLocationProtocolService$2.callBlockingMethod(ScmBlockLocationProtocolProtos.java:6255) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:876) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:822) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2682) 2018-10-09 23:37:42,044 INFO org.apache.hadoop.ipc.Server: IPC Server handler 3 on 9863, call Call#105 Retry#0 org.apache.hadoop.ozone.protocol.ScmBlockLocationProtocol.allocateScmBlock from 172.27.56.9:33814 org.apache.hadoop.hdds.scm.exceptions.SCMException: ChillModePrecheck failed for allocateBlock at org.apache.hadoop.hdds.scm.server.ChillModePrecheck.check(ChillModePrecheck.java:38) at org.apache.hadoop.hdds.scm.server.ChillModePrecheck.check(ChillModePrecheck.java:30) at org.apache.hadoop.hdds.scm.ScmUtils.preCheck(ScmUtils.java:42) at org.apache.hadoop.hdds.scm.block.BlockManagerImpl.allocateBlock(BlockManagerImpl.java:191) at org.apache.hadoop.hdds.scm.server.SCMBlockProtocolServer.allocateBlock(SCMBlockProtocolServer.java:143) at org.apache.hadoop.ozone.protocolPB.ScmBlockLocationProtocolServerSideTranslatorPB.allocateScmBlock(ScmBlockLocationProtocolServerSideTranslatorPB.java:74) at org.apache.hadoop.hdds.protocol.proto.ScmBlockLocationProtocolProtos$ScmBlockLocationProtocolService$2.callBlockingMethod(ScmBlockLocationProtocolProtos.java:6255) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:876) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:822) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2682) 2018-10-09 23:37:48,656 INFO org.apache.hadoop.ipc.Server: IPC Server handler 3 on 9863, call Call#107 Retry#0 org.apache.hadoop.ozone.protocol.ScmBlockLocationProtocol.allocateScmBlock from 172.27.56.9:33814 org.apache.hadoop.hdds.scm.exceptions.SCMException: ChillModePrecheck failed for allocateBlock at org.apache.hadoop.hdds.scm.server.ChillModePrecheck.check(ChillModePrecheck.java:38) at org.apache.hadoop.hdds.scm.server.ChillModePrecheck.check(ChillModePrecheck.java:30) at org.apache.hadoop.hdds.scm.ScmUtils.preCheck(ScmUtils.java:42) at org.apache.hadoop.hdds.scm.block.BlockManagerImpl.allocateBlock(BlockManagerImpl.java:191) at org.apache.hadoop.hdds.scm.server.SCMBlockProtocolServer.allocateBlock(SCMBlockProtocolServer.java:143) at org.apache.hadoop.ozone.protocolPB.ScmBlockLocationProtocolServerSideTranslatorPB.allocateScmBlock(ScmBlockLocationProtocolServerSideTranslatorPB.java:74) at org.apache.hadoop.hdds.protocol.proto.ScmBlockLocationProtocolProtos$ScmBlockLocationProtocolService$2.callBlockingMethod(ScmBlockLocationProtocolProtos.java:6255) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:876) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:822) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2682) {code} > On restart, SCM does not exit chill mode as it expects DNs to report > containers in ALLOCATED state > -------------------------------------------------------------------------------------------------- > > Key: HDDS-609 > URL: https://issues.apache.org/jira/browse/HDDS-609 > Project: Hadoop Distributed Data Store > Issue Type: Bug > Reporter: Namit Maheshwari > Assignee: Hanisha Koneru > Priority: Major > > Note: Updated the description to describe the root cause of the bug and moved > the error logs to comments. > On restart, SCM can exit chill mode only if it receives report of 99% > (default) of containers from the DNs. > SCM includes containers in ALLOCATED state in calculating the total number of > containers. But since ALLOCATED containers are not reported by DNs, the > calculation of percentage of reported containers is misconstrued. > {code:java} > For example, say we have 1DN in the cluster and we restart SCM. > Total number of containers in SCM ContainerMap = 20 > Containers in OPEN state = 2 > Containers in ALLOCATED state = 18 > Containers reported by DN on SCM restart = 2 > Fraction of reported containers as calculated by SCMChillNodeManager = (2/20) > = 0.10 > {code} > We should not include the ALLOCATED containers while calculating the total > number of containers for chill mode exit rule. Otherwise, for scenarios such > as above, SCM can never come out of chill mode. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org