Yiqun Lin created HDDS-2972:
-------------------------------

             Summary: Any container replication error can terminates SCM service
                 Key: HDDS-2972
                 URL: https://issues.apache.org/jira/browse/HDDS-2972
             Project: Hadoop Distributed Data Store
          Issue Type: Improvement
          Components: SCM
    Affects Versions: 0.4.1
            Reporter: Yiqun Lin
            Assignee: Yiqun Lin


I found there any container replication error running in ReplicationManager can 
terminates SCM service. It's a very expensive behavior to terminate the SCM 
service just because of one container replication error.

It's not worth to shutdown the SCM. We can be friendly to deal with this, catch 
the exception and print the warn message with thrown exception.

The shutdown info:
{noformat}
2020-01-30 08:16:04,705 ERROR 
org.apache.hadoop.hdds.scm.container.ReplicationManager: Exception in 
Replication Monitor Thread.
java.lang.IllegalArgumentException: Affinity node 
/dc1/rack1/b9343ca0-a4bc-4436-9671-bc1de6c8bd89 is not a member of topology
        at 
org.apache.hadoop.hdds.scm.net.NetworkTopologyImpl.checkAffinityNode(NetworkTopologyImpl.java:789)
        at 
org.apache.hadoop.hdds.scm.net.NetworkTopologyImpl.chooseRandom(NetworkTopologyImpl.java:399)
        at 
org.apache.hadoop.hdds.scm.container.placement.algorithms.SCMContainerPlacementRackAware.chooseNode(SCMContainerPlacementRackAware.java:249)
        at 
org.apache.hadoop.hdds.scm.container.placement.algorithms.SCMContainerPlacementRackAware.chooseDatanodes(SCMContainerPlacementRackAware.java:173)
        at 
org.apache.hadoop.hdds.scm.container.ReplicationManager.handleUnderReplicatedContainer(ReplicationManager.java:515)
        at 
org.apache.hadoop.hdds.scm.container.ReplicationManager.processContainer(ReplicationManager.java:311)
        at 
java.util.concurrent.ConcurrentHashMap$KeySetView.forEach(ConcurrentHashMap.java:4649)
        at 
java.util.Collections$UnmodifiableCollection.forEach(Collections.java:1080)
        at 
org.apache.hadoop.hdds.scm.container.ReplicationManager.run(ReplicationManager.java:223)
        at java.lang.Thread.run(Thread.java:745)
2020-01-30 08:16:04,730 INFO org.apache.hadoop.util.ExitUtil: Exiting with 
status 1: java.lang.IllegalArgumentException: Affinity node 
/dc1/rack1/b9343ca0-a4bc-4436-9671-bc1de6c8bd89 is not a member of topology
2020-01-30 08:16:04,734 INFO 
org.apache.hadoop.hdds.scm.server.StorageContainerManagerStarter: SHUTDOWN_MSG:
{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org

Reply via email to