[
https://issues.apache.org/jira/browse/HDDS-1713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16880730#comment-16880730
]
Xiaoyu Yao commented on HDDS-1713:
----------------------------------
Attach the error stack for reference.
{code}
2019-06-18 18:35:10,455 INFO
org.apache.hadoop.hdds.scm.container.ReplicationManager: Replication Monitor
Thread took 0 milliseconds for processing 36 containers.
2019-06-18 18:35:11,711 INFO
org.apache.hadoop.hdds.scm.container.CloseContainerEventHandler: Close
container Event triggered for container : #31
2019-06-18 18:35:11,711 INFO
org.apache.hadoop.hdds.scm.container.CloseContainerEventHandler: Close
container Event triggered for container : #31
2019-06-18 18:35:11,713 INFO
org.apache.hadoop.hdds.scm.container.ContainerReportHandler: Moving container
#25 to CLOSED state, datanode f6f0df20-a218-4b3c-aa5c-65243ab6f7e4{ip:
10.17.248.13, host:
[ve1303.halxg.cloudera.com|http://ve1303.halxg.cloudera.com/], networkLocation:
/default-rack, certSerialId: null} reported CLOSED replica.
2019-06-18 18:35:11,713 INFO
org.apache.hadoop.hdds.scm.container.ContainerReportHandler: Moving container
#26 to CLOSED state, datanode f6f0df20-a218-4b3c-aa5c-65243ab6f7e4{ip:
10.17.248.13, host:
[ve1303.halxg.cloudera.com|http://ve1303.halxg.cloudera.com/], networkLocation:
/default-rack, certSerialId: null} reported CLOSED replica.
2019-06-18 18:35:11,713 INFO
org.apache.hadoop.hdds.scm.container.ContainerReportHandler: Moving container
#27 to CLOSED state, datanode f6f0df20-a218-4b3c-aa5c-65243ab6f7e4{ip:
10.17.248.13, host:
[ve1303.halxg.cloudera.com|http://ve1303.halxg.cloudera.com/], networkLocation:
/default-rack, certSerialId: null} reported CLOSED replica.
2019-06-18 18:35:12,713 INFO
org.apache.hadoop.hdds.scm.container.CloseContainerEventHandler: Close
container Event triggered for container : #31
2019-06-18 18:35:13,456 INFO
org.apache.hadoop.hdds.scm.container.ReplicationManager: Sending close
container command for container #21 to datanode
b24e0df1-02c3-45d4-8279-764da0b87568{ip: 10.17.248.12, host:
[ve1302.halxg.cloudera.com|http://ve1302.halxg.cloudera.com/], networkLocation:
/default-rack, certSerialId: null}.
2019-06-18 18:35:13,456 INFO
org.apache.hadoop.hdds.scm.container.ReplicationManager: Sending close
container command for container #25 to datanode
3b23a6e2-f2c5-4320-83a6-8baff70d7217{ip: 10.17.187.42, host:
[va1032.halxg.cloudera.com|http://va1032.halxg.cloudera.com/], networkLocation:
/default-rack, certSerialId: null}.
2019-06-18 18:35:13,456 INFO
org.apache.hadoop.hdds.scm.container.ReplicationManager: Sending close
container command for container #25 to datanode
b24e0df1-02c3-45d4-8279-764da0b87568{ip: 10.17.248.12, host:
[ve1302.halxg.cloudera.com|http://ve1302.halxg.cloudera.com/], networkLocation:
/default-rack, certSerialId: null}.
2019-06-18 18:35:13,460 ERROR
org.apache.hadoop.hdds.scm.container.ReplicationManager: Exception in
Replication Monitor Thread.
java.lang.IllegalArgumentException: Affinity node /default-rack/ is not a
member of topology
at
org.apache.hadoop.hdds.scm.net.NetworkTopologyImpl.checkAffinityNode(NetworkTopologyImpl.java:780)
at
org.apache.hadoop.hdds.scm.net.NetworkTopologyImpl.chooseRandom(NetworkTopologyImpl.java:408)
at
org.apache.hadoop.hdds.scm.container.placement.algorithms.SCMContainerPlacementRackAware.chooseNode(SCMContainerPlacementRackAware.java:242)
at
org.apache.hadoop.hdds.scm.container.placement.algorithms.SCMContainerPlacementRackAware.chooseDatanodes(SCMContainerPlacementRackAware.java:168)
at
org.apache.hadoop.hdds.scm.container.ReplicationManager.handleUnderReplicatedContainer(ReplicationManager.java:487)
at
org.apache.hadoop.hdds.scm.container.ReplicationManager.processContainer(ReplicationManager.java:293)
at
java.util.concurrent.ConcurrentHashMap$KeySetView.forEach(ConcurrentHashMap.java:4649)
at
java.util.Collections$UnmodifiableCollection.forEach(Collections.java:1080)
at
org.apache.hadoop.hdds.scm.container.ReplicationManager.run(ReplicationManager.java:205)
at java.lang.Thread.run(Thread.java:748)
2019-06-18 18:35:13,462 INFO org.apache.hadoop.util.ExitUtil: Exiting with
status 1: java.lang.IllegalArgumentException: Affinity node /default-rack/ is
not a member of topology
2019-06-18 18:35:13,463 INFO
org.apache.hadoop.hdds.scm.server.StorageContainerManagerStarter: SHUTDOWN_MSG:
{code}
> ReplicationManager fail to find proper node topology based on Datanode
> details from heartbeat
> ---------------------------------------------------------------------------------------------
>
> Key: HDDS-1713
> URL: https://issues.apache.org/jira/browse/HDDS-1713
> Project: Hadoop Distributed Data Store
> Issue Type: Sub-task
> Reporter: Xiaoyu Yao
> Assignee: Xiaoyu Yao
> Priority: Major
> Labels: pull-request-available
> Time Spent: 0.5h
> Remaining Estimate: 0h
>
> DN does not have the topology info included in its heartbeat message for
> container report/pipeline report.
> SCM is where the topology information is available. During the processing of
> heartbeat, we should not rely on the datanodedetails from report to choose
> datanodes for close container. Otherwise, all the datanode locations of
> existing container replicas will fallback to /default-rack.
>
> The fix is to retrieve the corresponding datanode locations from scm
> nodemanager, which has authoritative network topology information.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]