[ 
https://issues.apache.org/jira/browse/HDDS-609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16646970#comment-16646970
 ] 

Hanisha Koneru commented on HDDS-609:
-------------------------------------

Initial error logs reported by [~nmaheshwari] (moved from the description):
{code:java}
-bash-4.2$ /usr/hdp/current/hadoop-client/bin/hadoop jar 
/usr/hdp/current/hadoop-mapreduce-client/hadoop-mapreduce-examples.jar 
wordcount /tmp/mr_jobs/input/ o3://bucket2.volume2/mr_job5
18/10/09 23:37:07 INFO conf.Configuration: Removed undeclared tags:
18/10/09 23:37:08 INFO conf.Configuration: Removed undeclared tags:
18/10/09 23:37:08 INFO conf.Configuration: Removed undeclared tags:
18/10/09 23:37:08 INFO client.AHSProxy: Connecting to Application History 
server at ctr-e138-1518143905142-510793-01-000004.hwx.site/172.27.79.197:10200
18/10/09 23:37:08 INFO client.ConfiguredRMFailoverProxyProvider: Failing over 
to rm2
18/10/09 23:37:09 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding 
for path: /user/hdfs/.staging/job_1539125785626_0007
18/10/09 23:37:09 INFO input.FileInputFormat: Total input files to process : 1
18/10/09 23:37:09 INFO lzo.GPLNativeCodeLoader: Loaded native gpl library
18/10/09 23:37:09 INFO lzo.LzoCodec: Successfully loaded & initialized 
native-lzo library [hadoop-lzo rev 5d6248d8d690f8456469979213ab2e9993bfa2e9]
18/10/09 23:37:09 INFO mapreduce.JobSubmitter: number of splits:1
18/10/09 23:37:09 INFO mapreduce.JobSubmitter: Submitting tokens for job: 
job_1539125785626_0007
18/10/09 23:37:09 INFO mapreduce.JobSubmitter: Executing with tokens: []
18/10/09 23:37:10 INFO conf.Configuration: Removed undeclared tags:
18/10/09 23:37:10 INFO conf.Configuration: found resource resource-types.xml at 
file:/etc/hadoop/3.0.3.0-63/0/resource-types.xml
18/10/09 23:37:10 INFO conf.Configuration: Removed undeclared tags:
18/10/09 23:37:10 INFO impl.YarnClientImpl: Submitted application 
application_1539125785626_0007
18/10/09 23:37:10 INFO mapreduce.Job: The url to track the job: 
http://ctr-e138-1518143905142-510793-01-000005.hwx.site:8088/proxy/application_1539125785626_0007/
18/10/09 23:37:10 INFO mapreduce.Job: Running job: job_1539125785626_0007
18/10/09 23:37:17 INFO mapreduce.Job: Job job_1539125785626_0007 running in 
uber mode : false
18/10/09 23:37:17 INFO mapreduce.Job: map 0% reduce 0%
18/10/09 23:37:24 INFO mapreduce.Job: map 100% reduce 0%
18/10/09 23:37:29 INFO mapreduce.Job: Task Id : 
attempt_1539125785626_0007_r_000000_0, Status : FAILED
Error: java.io.IOException: Allocate block failed, error:INTERNAL_ERROR
at 
org.apache.hadoop.ozone.om.protocolPB.OzoneManagerProtocolClientSideTranslatorPB.allocateBlock(OzoneManagerProtocolClientSideTranslatorPB.java:576)
at 
org.apache.hadoop.ozone.client.io.ChunkGroupOutputStream.allocateNewBlock(ChunkGroupOutputStream.java:475)
at 
org.apache.hadoop.ozone.client.io.ChunkGroupOutputStream.handleWrite(ChunkGroupOutputStream.java:271)
at 
org.apache.hadoop.ozone.client.io.ChunkGroupOutputStream.write(ChunkGroupOutputStream.java:250)
at 
org.apache.hadoop.fs.ozone.OzoneFSOutputStream.write(OzoneFSOutputStream.java:47)
at 
org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:57)
at java.io.DataOutputStream.write(DataOutputStream.java:107)
at 
org.apache.hadoop.mapreduce.lib.output.TextOutputFormat$LineRecordWriter.writeObject(TextOutputFormat.java:78)
at 
org.apache.hadoop.mapreduce.lib.output.TextOutputFormat$LineRecordWriter.write(TextOutputFormat.java:93)
at 
org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.write(ReduceTask.java:559)
at 
org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89)
at 
org.apache.hadoop.mapreduce.lib.reduce.WrappedReducer$Context.write(WrappedReducer.java:105)
at org.apache.hadoop.examples.WordCount$IntSumReducer.reduce(WordCount.java:64)
at org.apache.hadoop.examples.WordCount$IntSumReducer.reduce(WordCount.java:52)
at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:171)
at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:628)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:390)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:174)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:168)

18/10/09 23:37:35 INFO mapreduce.Job: Task Id : 
attempt_1539125785626_0007_r_000000_1, Status : FAILED
Error: java.io.IOException: Allocate block failed, error:INTERNAL_ERROR
at 
org.apache.hadoop.ozone.om.protocolPB.OzoneManagerProtocolClientSideTranslatorPB.allocateBlock(OzoneManagerProtocolClientSideTranslatorPB.java:576)
at 
org.apache.hadoop.ozone.client.io.ChunkGroupOutputStream.allocateNewBlock(ChunkGroupOutputStream.java:475)
at 
org.apache.hadoop.ozone.client.io.ChunkGroupOutputStream.handleWrite(ChunkGroupOutputStream.java:271)
at 
org.apache.hadoop.ozone.client.io.ChunkGroupOutputStream.write(ChunkGroupOutputStream.java:250)
at 
org.apache.hadoop.fs.ozone.OzoneFSOutputStream.write(OzoneFSOutputStream.java:47)
at 
org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:57)
at java.io.DataOutputStream.write(DataOutputStream.java:107)
at 
org.apache.hadoop.mapreduce.lib.output.TextOutputFormat$LineRecordWriter.writeObject(TextOutputFormat.java:78)
at 
org.apache.hadoop.mapreduce.lib.output.TextOutputFormat$LineRecordWriter.write(TextOutputFormat.java:93)
at 
org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.write(ReduceTask.java:559)
at 
org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89)
at 
org.apache.hadoop.mapreduce.lib.reduce.WrappedReducer$Context.write(WrappedReducer.java:105)
at org.apache.hadoop.examples.WordCount$IntSumReducer.reduce(WordCount.java:64)
at org.apache.hadoop.examples.WordCount$IntSumReducer.reduce(WordCount.java:52)
at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:171)
at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:628)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:390)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:174)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:168)

18/10/09 23:37:42 INFO mapreduce.Job: Task Id : 
attempt_1539125785626_0007_r_000000_2, Status : FAILED
Error: java.io.IOException: Allocate block failed, error:INTERNAL_ERROR
at 
org.apache.hadoop.ozone.om.protocolPB.OzoneManagerProtocolClientSideTranslatorPB.allocateBlock(OzoneManagerProtocolClientSideTranslatorPB.java:576)
at 
org.apache.hadoop.ozone.client.io.ChunkGroupOutputStream.allocateNewBlock(ChunkGroupOutputStream.java:475)
at 
org.apache.hadoop.ozone.client.io.ChunkGroupOutputStream.handleWrite(ChunkGroupOutputStream.java:271)
at 
org.apache.hadoop.ozone.client.io.ChunkGroupOutputStream.write(ChunkGroupOutputStream.java:250)
at 
org.apache.hadoop.fs.ozone.OzoneFSOutputStream.write(OzoneFSOutputStream.java:47)
at 
org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:57)
at java.io.DataOutputStream.write(DataOutputStream.java:107)
at 
org.apache.hadoop.mapreduce.lib.output.TextOutputFormat$LineRecordWriter.writeObject(TextOutputFormat.java:78)
at 
org.apache.hadoop.mapreduce.lib.output.TextOutputFormat$LineRecordWriter.write(TextOutputFormat.java:93)
at 
org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.write(ReduceTask.java:559)
at 
org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89)
at 
org.apache.hadoop.mapreduce.lib.reduce.WrappedReducer$Context.write(WrappedReducer.java:105)
at org.apache.hadoop.examples.WordCount$IntSumReducer.reduce(WordCount.java:64)
at org.apache.hadoop.examples.WordCount$IntSumReducer.reduce(WordCount.java:52)
at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:171)
at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:628)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:390)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:174)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:168)

18/10/09 23:37:49 INFO mapreduce.Job: map 100% reduce 100%
18/10/09 23:37:50 INFO mapreduce.Job: Job job_1539125785626_0007 failed with 
state FAILED due to: Task failed task_1539125785626_0007_r_000000
Job failed as tasks failed. failedMaps:0 failedReduces:1 killedMaps:0 
killedReduces: 0

18/10/09 23:37:50 INFO conf.Configuration: Removed undeclared tags:
18/10/09 23:37:51 INFO mapreduce.Job: Counters: 45
File System Counters
FILE: Number of bytes read=0
FILE: Number of bytes written=266505
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=215876
HDFS: Number of bytes written=0
HDFS: Number of read operations=2
HDFS: Number of large read operations=0
HDFS: Number of write operations=0
O3: Number of bytes read=0
O3: Number of bytes written=0
O3: Number of read operations=0
O3: Number of large read operations=0
O3: Number of write operations=0
Job Counters
Failed reduce tasks=4
Launched map tasks=1
Launched reduce tasks=4
Rack-local map tasks=1
Total time spent by all maps in occupied slots (ms)=18816
Total time spent by all reduces in occupied slots (ms)=128016
Total time spent by all map tasks (ms)=4704
Total time spent by all reduce tasks (ms)=16002
Total vcore-milliseconds taken by all map tasks=4704
Total vcore-milliseconds taken by all reduce tasks=16002
Total megabyte-milliseconds taken by all map tasks=19267584
Total megabyte-milliseconds taken by all reduce tasks=131088384
Map-Reduce Framework
Map input records=716
Map output records=32019
Map output bytes=343475
Map output materialized bytes=6332
Input split bytes=121
Combine input records=32019
Combine output records=461
Spilled Records=461
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=96
CPU time spent (ms)=3300
Physical memory (bytes) snapshot=2528358400
Virtual memory (bytes) snapshot=5420421120
Total committed heap usage (bytes)=2714238976
Peak Map Physical memory (bytes)=2528358400
Peak Map Virtual memory (bytes)=5420421120
File Input Format Counters
Bytes Read=215755
18/10/09 23:37:51 INFO conf.Configuration: Removed undeclared tags:
{code}
SCM logs
{code:java}
2018-10-09 23:37:28,984 INFO org.apache.hadoop.ipc.Server: IPC Server handler 4 
on 9863, call Call#101 Retry#0 
org.apache.hadoop.ozone.protocol.ScmBlockLocationProtocol.allocateScmBlock from 
172.27.56.9:33814
org.apache.hadoop.hdds.scm.exceptions.SCMException: ChillModePrecheck failed 
for allocateBlock
at 
org.apache.hadoop.hdds.scm.server.ChillModePrecheck.check(ChillModePrecheck.java:38)
at 
org.apache.hadoop.hdds.scm.server.ChillModePrecheck.check(ChillModePrecheck.java:30)
at org.apache.hadoop.hdds.scm.ScmUtils.preCheck(ScmUtils.java:42)
at 
org.apache.hadoop.hdds.scm.block.BlockManagerImpl.allocateBlock(BlockManagerImpl.java:191)
at 
org.apache.hadoop.hdds.scm.server.SCMBlockProtocolServer.allocateBlock(SCMBlockProtocolServer.java:143)
at 
org.apache.hadoop.ozone.protocolPB.ScmBlockLocationProtocolServerSideTranslatorPB.allocateScmBlock(ScmBlockLocationProtocolServerSideTranslatorPB.java:74)
at 
org.apache.hadoop.hdds.protocol.proto.ScmBlockLocationProtocolProtos$ScmBlockLocationProtocolService$2.callBlockingMethod(ScmBlockLocationProtocolProtos.java:6255)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:876)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:822)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2682)
2018-10-09 23:37:35,232 INFO org.apache.hadoop.ipc.Server: IPC Server handler 4 
on 9863, call Call#103 Retry#0 
org.apache.hadoop.ozone.protocol.ScmBlockLocationProtocol.allocateScmBlock from 
172.27.56.9:33814
org.apache.hadoop.hdds.scm.exceptions.SCMException: ChillModePrecheck failed 
for allocateBlock
at 
org.apache.hadoop.hdds.scm.server.ChillModePrecheck.check(ChillModePrecheck.java:38)
at 
org.apache.hadoop.hdds.scm.server.ChillModePrecheck.check(ChillModePrecheck.java:30)
at org.apache.hadoop.hdds.scm.ScmUtils.preCheck(ScmUtils.java:42)
at 
org.apache.hadoop.hdds.scm.block.BlockManagerImpl.allocateBlock(BlockManagerImpl.java:191)
at 
org.apache.hadoop.hdds.scm.server.SCMBlockProtocolServer.allocateBlock(SCMBlockProtocolServer.java:143)
at 
org.apache.hadoop.ozone.protocolPB.ScmBlockLocationProtocolServerSideTranslatorPB.allocateScmBlock(ScmBlockLocationProtocolServerSideTranslatorPB.java:74)
at 
org.apache.hadoop.hdds.protocol.proto.ScmBlockLocationProtocolProtos$ScmBlockLocationProtocolService$2.callBlockingMethod(ScmBlockLocationProtocolProtos.java:6255)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:876)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:822)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2682)
2018-10-09 23:37:42,044 INFO org.apache.hadoop.ipc.Server: IPC Server handler 3 
on 9863, call Call#105 Retry#0 
org.apache.hadoop.ozone.protocol.ScmBlockLocationProtocol.allocateScmBlock from 
172.27.56.9:33814
org.apache.hadoop.hdds.scm.exceptions.SCMException: ChillModePrecheck failed 
for allocateBlock
at 
org.apache.hadoop.hdds.scm.server.ChillModePrecheck.check(ChillModePrecheck.java:38)
at 
org.apache.hadoop.hdds.scm.server.ChillModePrecheck.check(ChillModePrecheck.java:30)
at org.apache.hadoop.hdds.scm.ScmUtils.preCheck(ScmUtils.java:42)
at 
org.apache.hadoop.hdds.scm.block.BlockManagerImpl.allocateBlock(BlockManagerImpl.java:191)
at 
org.apache.hadoop.hdds.scm.server.SCMBlockProtocolServer.allocateBlock(SCMBlockProtocolServer.java:143)
at 
org.apache.hadoop.ozone.protocolPB.ScmBlockLocationProtocolServerSideTranslatorPB.allocateScmBlock(ScmBlockLocationProtocolServerSideTranslatorPB.java:74)
at 
org.apache.hadoop.hdds.protocol.proto.ScmBlockLocationProtocolProtos$ScmBlockLocationProtocolService$2.callBlockingMethod(ScmBlockLocationProtocolProtos.java:6255)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:876)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:822)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2682)
2018-10-09 23:37:48,656 INFO org.apache.hadoop.ipc.Server: IPC Server handler 3 
on 9863, call Call#107 Retry#0 
org.apache.hadoop.ozone.protocol.ScmBlockLocationProtocol.allocateScmBlock from 
172.27.56.9:33814
org.apache.hadoop.hdds.scm.exceptions.SCMException: ChillModePrecheck failed 
for allocateBlock
at 
org.apache.hadoop.hdds.scm.server.ChillModePrecheck.check(ChillModePrecheck.java:38)
at 
org.apache.hadoop.hdds.scm.server.ChillModePrecheck.check(ChillModePrecheck.java:30)
at org.apache.hadoop.hdds.scm.ScmUtils.preCheck(ScmUtils.java:42)
at 
org.apache.hadoop.hdds.scm.block.BlockManagerImpl.allocateBlock(BlockManagerImpl.java:191)
at 
org.apache.hadoop.hdds.scm.server.SCMBlockProtocolServer.allocateBlock(SCMBlockProtocolServer.java:143)
at 
org.apache.hadoop.ozone.protocolPB.ScmBlockLocationProtocolServerSideTranslatorPB.allocateScmBlock(ScmBlockLocationProtocolServerSideTranslatorPB.java:74)
at 
org.apache.hadoop.hdds.protocol.proto.ScmBlockLocationProtocolProtos$ScmBlockLocationProtocolService$2.callBlockingMethod(ScmBlockLocationProtocolProtos.java:6255)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:876)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:822)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2682)
{code}

> On restart, SCM does not exit chill mode as it expects DNs to report 
> containers in ALLOCATED state
> --------------------------------------------------------------------------------------------------
>
>                 Key: HDDS-609
>                 URL: https://issues.apache.org/jira/browse/HDDS-609
>             Project: Hadoop Distributed Data Store
>          Issue Type: Bug
>            Reporter: Namit Maheshwari
>            Assignee: Hanisha Koneru
>            Priority: Major
>
> Note: Updated the description to describe the root cause of the bug and moved 
> the error logs to comments.
> On restart, SCM can exit chill mode only if it receives report of 99% 
> (default) of containers from the DNs. 
> SCM includes containers in ALLOCATED state in calculating the total number of 
> containers. But since ALLOCATED containers are not reported by DNs, the 
> calculation of percentage of reported containers is misconstrued.
> {code:java}
> For example, say we have 1DN in the cluster and we restart SCM.
> Total number of containers in SCM ContainerMap = 20
> Containers in OPEN state = 2
> Containers in ALLOCATED state = 18
> Containers reported by DN on SCM restart = 2 
> Fraction of reported containers as calculated by SCMChillNodeManager = (2/20) 
> = 0.10
>  {code}
> We should not include the ALLOCATED containers while calculating the total 
> number of containers for chill mode exit rule. Otherwise, for scenarios such 
> as above, SCM can never come out of chill mode.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to