[ 
https://issues.apache.org/jira/browse/HDDS-1713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16880730#comment-16880730
 ] 

Xiaoyu Yao commented on HDDS-1713:
----------------------------------

Attach the error stack for reference.

{code}

2019-06-18 18:35:10,455 INFO 
org.apache.hadoop.hdds.scm.container.ReplicationManager: Replication Monitor 
Thread took 0 milliseconds for processing 36 containers.
2019-06-18 18:35:11,711 INFO 
org.apache.hadoop.hdds.scm.container.CloseContainerEventHandler: Close 
container Event triggered for container : #31
2019-06-18 18:35:11,711 INFO 
org.apache.hadoop.hdds.scm.container.CloseContainerEventHandler: Close 
container Event triggered for container : #31
2019-06-18 18:35:11,713 INFO 
org.apache.hadoop.hdds.scm.container.ContainerReportHandler: Moving container 
#25 to CLOSED state, datanode f6f0df20-a218-4b3c-aa5c-65243ab6f7e4{ip: 
10.17.248.13, host: 
[ve1303.halxg.cloudera.com|http://ve1303.halxg.cloudera.com/], networkLocation: 
/default-rack, certSerialId: null} reported CLOSED replica.
2019-06-18 18:35:11,713 INFO 
org.apache.hadoop.hdds.scm.container.ContainerReportHandler: Moving container 
#26 to CLOSED state, datanode f6f0df20-a218-4b3c-aa5c-65243ab6f7e4{ip: 
10.17.248.13, host: 
[ve1303.halxg.cloudera.com|http://ve1303.halxg.cloudera.com/], networkLocation: 
/default-rack, certSerialId: null} reported CLOSED replica.
2019-06-18 18:35:11,713 INFO 
org.apache.hadoop.hdds.scm.container.ContainerReportHandler: Moving container 
#27 to CLOSED state, datanode f6f0df20-a218-4b3c-aa5c-65243ab6f7e4{ip: 
10.17.248.13, host: 
[ve1303.halxg.cloudera.com|http://ve1303.halxg.cloudera.com/], networkLocation: 
/default-rack, certSerialId: null} reported CLOSED replica.
2019-06-18 18:35:12,713 INFO 
org.apache.hadoop.hdds.scm.container.CloseContainerEventHandler: Close 
container Event triggered for container : #31
2019-06-18 18:35:13,456 INFO 
org.apache.hadoop.hdds.scm.container.ReplicationManager: Sending close 
container command for container #21 to datanode 
b24e0df1-02c3-45d4-8279-764da0b87568{ip: 10.17.248.12, host: 
[ve1302.halxg.cloudera.com|http://ve1302.halxg.cloudera.com/], networkLocation: 
/default-rack, certSerialId: null}.
2019-06-18 18:35:13,456 INFO 
org.apache.hadoop.hdds.scm.container.ReplicationManager: Sending close 
container command for container #25 to datanode 
3b23a6e2-f2c5-4320-83a6-8baff70d7217{ip: 10.17.187.42, host: 
[va1032.halxg.cloudera.com|http://va1032.halxg.cloudera.com/], networkLocation: 
/default-rack, certSerialId: null}.
2019-06-18 18:35:13,456 INFO 
org.apache.hadoop.hdds.scm.container.ReplicationManager: Sending close 
container command for container #25 to datanode 
b24e0df1-02c3-45d4-8279-764da0b87568{ip: 10.17.248.12, host: 
[ve1302.halxg.cloudera.com|http://ve1302.halxg.cloudera.com/], networkLocation: 
/default-rack, certSerialId: null}.
2019-06-18 18:35:13,460 ERROR 
org.apache.hadoop.hdds.scm.container.ReplicationManager: Exception in 
Replication Monitor Thread.
java.lang.IllegalArgumentException: Affinity node /default-rack/ is not a 
member of topology
        at 
org.apache.hadoop.hdds.scm.net.NetworkTopologyImpl.checkAffinityNode(NetworkTopologyImpl.java:780)
        at 
org.apache.hadoop.hdds.scm.net.NetworkTopologyImpl.chooseRandom(NetworkTopologyImpl.java:408)
        at 
org.apache.hadoop.hdds.scm.container.placement.algorithms.SCMContainerPlacementRackAware.chooseNode(SCMContainerPlacementRackAware.java:242)
        at 
org.apache.hadoop.hdds.scm.container.placement.algorithms.SCMContainerPlacementRackAware.chooseDatanodes(SCMContainerPlacementRackAware.java:168)
        at 
org.apache.hadoop.hdds.scm.container.ReplicationManager.handleUnderReplicatedContainer(ReplicationManager.java:487)
        at 
org.apache.hadoop.hdds.scm.container.ReplicationManager.processContainer(ReplicationManager.java:293)
        at 
java.util.concurrent.ConcurrentHashMap$KeySetView.forEach(ConcurrentHashMap.java:4649)
        at 
java.util.Collections$UnmodifiableCollection.forEach(Collections.java:1080)
        at 
org.apache.hadoop.hdds.scm.container.ReplicationManager.run(ReplicationManager.java:205)
        at java.lang.Thread.run(Thread.java:748)
2019-06-18 18:35:13,462 INFO org.apache.hadoop.util.ExitUtil: Exiting with 
status 1: java.lang.IllegalArgumentException: Affinity node /default-rack/ is 
not a member of topology
2019-06-18 18:35:13,463 INFO 
org.apache.hadoop.hdds.scm.server.StorageContainerManagerStarter: SHUTDOWN_MSG:

{code}

> ReplicationManager fail to find proper node topology based on Datanode 
> details from heartbeat
> ---------------------------------------------------------------------------------------------
>
>                 Key: HDDS-1713
>                 URL: https://issues.apache.org/jira/browse/HDDS-1713
>             Project: Hadoop Distributed Data Store
>          Issue Type: Sub-task
>            Reporter: Xiaoyu Yao
>            Assignee: Xiaoyu Yao
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> DN does not have the topology info included in its heartbeat message for 
> container report/pipeline report.
> SCM is where the topology information is available. During the processing of 
> heartbeat, we should not rely on the datanodedetails from report to choose 
> datanodes for close container. Otherwise, all the datanode locations of 
> existing container replicas will fallback to /default-rack.
>  
> The fix is to retrieve the corresponding datanode locations from scm 
> nodemanager, which has authoritative network topology information. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to