Yiqun Lin created HDDS-2972: ------------------------------- Summary: Any container replication error can terminates SCM service Key: HDDS-2972 URL: https://issues.apache.org/jira/browse/HDDS-2972 Project: Hadoop Distributed Data Store Issue Type: Improvement Components: SCM Affects Versions: 0.4.1 Reporter: Yiqun Lin Assignee: Yiqun Lin
I found there any container replication error running in ReplicationManager can terminates SCM service. It's a very expensive behavior to terminate the SCM service just because of one container replication error. It's not worth to shutdown the SCM. We can be friendly to deal with this, catch the exception and print the warn message with thrown exception. The shutdown info: {noformat} 2020-01-30 08:16:04,705 ERROR org.apache.hadoop.hdds.scm.container.ReplicationManager: Exception in Replication Monitor Thread. java.lang.IllegalArgumentException: Affinity node /dc1/rack1/b9343ca0-a4bc-4436-9671-bc1de6c8bd89 is not a member of topology at org.apache.hadoop.hdds.scm.net.NetworkTopologyImpl.checkAffinityNode(NetworkTopologyImpl.java:789) at org.apache.hadoop.hdds.scm.net.NetworkTopologyImpl.chooseRandom(NetworkTopologyImpl.java:399) at org.apache.hadoop.hdds.scm.container.placement.algorithms.SCMContainerPlacementRackAware.chooseNode(SCMContainerPlacementRackAware.java:249) at org.apache.hadoop.hdds.scm.container.placement.algorithms.SCMContainerPlacementRackAware.chooseDatanodes(SCMContainerPlacementRackAware.java:173) at org.apache.hadoop.hdds.scm.container.ReplicationManager.handleUnderReplicatedContainer(ReplicationManager.java:515) at org.apache.hadoop.hdds.scm.container.ReplicationManager.processContainer(ReplicationManager.java:311) at java.util.concurrent.ConcurrentHashMap$KeySetView.forEach(ConcurrentHashMap.java:4649) at java.util.Collections$UnmodifiableCollection.forEach(Collections.java:1080) at org.apache.hadoop.hdds.scm.container.ReplicationManager.run(ReplicationManager.java:223) at java.lang.Thread.run(Thread.java:745) 2020-01-30 08:16:04,730 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 1: java.lang.IllegalArgumentException: Affinity node /dc1/rack1/b9343ca0-a4bc-4436-9671-bc1de6c8bd89 is not a member of topology 2020-01-30 08:16:04,734 INFO org.apache.hadoop.hdds.scm.server.StorageContainerManagerStarter: SHUTDOWN_MSG: {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org