[
https://issues.apache.org/jira/browse/HDDS-3920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sammi Chen resolved HDDS-3920.
------------------------------
Resolution: Fixed
> Too many redudant replications due to fail to get node's ancestor in
> ReplicationManager
> ---------------------------------------------------------------------------------------
>
> Key: HDDS-3920
> URL: https://issues.apache.org/jira/browse/HDDS-3920
> Project: Hadoop Distributed Data Store
> Issue Type: Bug
> Reporter: Sammi Chen
> Assignee: Sammi Chen
> Priority: Blocker
> Labels: pull-request-available
> Attachments: over-replicated-container-list.txt
>
>
> In our production cluster, we turn on the network topology configuraiton.
> Due to fail to get the node's ancestor(the datanode object used doesn't have
> parent corrently set) in ReplicationManager during the under-replicate and
> over-replicate check, ReplicationManager think the replicas of the container
> doean't meet the acrossing more than one rack requirement, then treat the
> container as under-replicate although it already has many replicas, and send
> command to datanodes to replicate the container again and again.
> 2020-07-03 16:26:45,200 [ReplicationMonitor] INFO
> org.apache.hadoop.hdds.scm.container.ReplicationManager: Container #105228 is
> over replicated. Expected replica count is 3, but found 31.
> 2020-07-03 10:48:00,161 [ReplicationMonitor] DEBUG
> org.apache.hadoop.hdds.scm.container.ReplicationManager: Handling
> underreplicated container: 210413
> 2020-07-03 10:48:00,161 [ReplicationMonitor] DEBUG
> org.apache.hadoop.hdds.scm.container.ReplicationManager: deletionInFlight of
> container {}#210413
> 2020-07-03 10:48:00,161 [ReplicationMonitor] DEBUG
> org.apache.hadoop.hdds.scm.container.ReplicationManager: replicationInFlight
> of container {}#210413
> 2020-07-03 10:48:00,161 [ReplicationMonitor] DEBUG
> org.apache.hadoop.hdds.scm.container.ReplicationManager: 9.180.20.43
> 2020-07-03 10:48:00,161 [ReplicationMonitor] DEBUG
> org.apache.hadoop.hdds.scm.container.ReplicationManager: source of container
> {}#210413
> 2020-07-03 10:48:00,161 [ReplicationMonitor] DEBUG
> org.apache.hadoop.hdds.scm.container.ReplicationManager: 9.180.5.41
> 2020-07-03 10:48:00,161 [ReplicationMonitor] DEBUG
> org.apache.hadoop.hdds.scm.container.ReplicationManager: 9.179.142.251
> 2020-07-03 10:48:00,161 [ReplicationMonitor] DEBUG
> org.apache.hadoop.hdds.scm.container.ReplicationManager: 9.180.8.85
> 2020-07-03 10:48:00,161 [ReplicationMonitor] DEBUG
> org.apache.hadoop.hdds.scm.container.ReplicationManager: 9.179.142.250
> 2020-07-03 10:48:00,161 [ReplicationMonitor] DEBUG
> org.apache.hadoop.hdds.scm.container.ReplicationManager: 9.180.8.35
> 2020-07-03 10:48:00,161 [ReplicationMonitor] DEBUG
> org.apache.hadoop.hdds.scm.container.ReplicationManager: 9.180.8.67
> 2020-07-03 10:48:00,161 [ReplicationMonitor] DEBUG
> org.apache.hadoop.hdds.scm.container.ReplicationManager: 9.179.142.135
> 2020-07-03 10:48:00,161 [ReplicationMonitor] DEBUG
> org.apache.hadoop.hdds.scm.container.ReplicationManager: 9.179.144.104
> 2020-07-03 10:48:00,161 [ReplicationMonitor] DEBUG
> org.apache.hadoop.hdds.scm.container.ReplicationManager: 9.180.20.58
> 2020-07-03 10:48:00,161 [ReplicationMonitor] DEBUG
> org.apache.hadoop.hdds.scm.container.ReplicationManager: 9.179.142.198
> 2020-07-03 10:48:00,161 [ReplicationMonitor] DEBUG
> org.apache.hadoop.hdds.scm.container.ReplicationManager: 9.180.20.222
> 2020-07-03 10:48:00,161 [ReplicationMonitor] WARN
> org.apache.hadoop.hdds.scm.container.ReplicationManager: Process container
> #210413 error:
> java.lang.IllegalArgumentException
> at
> com.google.common.base.Preconditions.checkArgument(Preconditions.java:128)
> at
> org.apache.hadoop.hdds.scm.container.placement.algorithms.SCMContainerPlacementRackAware.chooseDatanodes(SCMContainerPlacementRackAware.java:101)
> at
> org.apache.hadoop.hdds.scm.container.ReplicationManager.handleUnderReplicatedContainer(ReplicationManager.java:568)
> at
> org.apache.hadoop.hdds.scm.container.ReplicationManager.processContainer(ReplicationManager.java:331)
> 2020-07-03 10:48:00,161 [ReplicationMonitor] WARN
> org.apache.hadoop.hdds.scm.net.NetUtils: Fail to get ancestor generation 1 of
> node :f8d9ccf6-20c6-4dfa-8a49-012f43a1b27e{ip: 9.179.142.251, host: host251,
> networkLocation: /rack3, certSerialId: null}
> 2020-07-03 10:48:00,161 [ReplicationMonitor] WARN
> org.apache.hadoop.hdds.scm.net.NetUtils: Fail to get ancestor generation 1 of
> node :826dda09-1259-4c5c-9a80-56b985665dc4{ip: 9.180.6.157, host:
> host-9-180-6-157, networkLocation: /rack10, certSerialId: null}
> 2020-07-03 10:48:00,161 [ReplicationMonitor] WARN
> org.apache.hadoop.hdds.scm.net.NetUtils: Fail to get ancestor generation 1 of
> node :b85962f2-6647-463b-9944-3c9b24e4e313{ip: 9.180.19.148, host:
> host-9-180-19-148, networkLocation: /rack3, certSerialId: null}
> 2020-07-03 10:48:00,161 [ReplicationMonitor] WARN
> org.apache.hadoop.hdds.scm.net.NetUtils: Fail to get ancestor generation 1 of
> node :039cb21e-4e2e-47e2-bf3e-b025319ee856{ip: 9.179.142.158, host: host158,
> networkLocation: /rack1, certSerialId: null}
> 2020-07-03 10:48:00,161 [ReplicationMonitor] WARN
> org.apache.hadoop.hdds.scm.net.InnerNodeImpl: Ancestor not found, node:
> /rack1/33b49c34-caa2-4b4f-894e-dce7db4f97b9, generation to exclude: 1,
> generation to return: 1
> 2020-07-03 10:48:00,161 [ReplicationMonitor] WARN
> org.apache.hadoop.hdds.scm.net.InnerNodeImpl: Ancestor not found, node:
> /rack3/b1e555d4-7114-4b80-b425-93086b0f2036, generation to exclude: 1,
> generation to return: 1
> 2020-07-03 10:48:00,161 [ReplicationMonitor] WARN
> org.apache.hadoop.hdds.scm.net.InnerNodeImpl: Ancestor not found, node:
> /rack1/55148789-0cdb-4631-a3b3-c1da774523aa, generation to exclude: 1,
> generation to return: 1
> 2020-07-03 10:48:00,161 [ReplicationMonitor] WARN
> org.apache.hadoop.hdds.scm.net.InnerNodeImpl: Ancestor not found, node:
> /rack3/32e8d855-b702-438d-b829-ac43dc567afc, generation to exclude: 1,
> generation to return: 1
> 2020-07-03 10:48:00,161 [ReplicationMonitor] WARN
> org.apache.hadoop.hdds.scm.net.InnerNodeImpl: Ancestor not found, node:
> /rack2/2e1b2fdd-f8fb-4252-bfc1-31d5339681be, generation to exclude: 1,
> generation to return: 1
> 2020-07-03 10:48:00,161 [ReplicationMonitor] WARN
> org.apache.hadoop.hdds.scm.net.InnerNodeImpl: Ancestor not found, node:
> /rack3/db854037-4846-4093-89de-e492e0f14239, generation to exclude: 1,
> generation to return: 1
> 2020-07-03 10:48:00,161 [ReplicationMonitor] WARN
> org.apache.hadoop.hdds.scm.net.InnerNodeImpl: Ancestor not found, node:
> /rack3/f8d9ccf6-20c6-4dfa-8a49-012f43a1b27e, generation to exclude: 1,
> generation to return: 1
> 2020-07-03 10:48:00,161 [ReplicationMonitor] WARN
> org.apache.hadoop.hdds.scm.net.InnerNodeImpl: Ancestor not found, node:
> /rack10/826dda09-1259-4c5c-9a80-56b985665dc4, generation to exclude: 1,
> generation to return: 1
> 2020-07-03 10:48:00,161 [ReplicationMonitor] WARN
> org.apache.hadoop.hdds.scm.net.InnerNodeImpl: Ancestor not found, node:
> /rack3/b85962f2-6647-463b-9944-3c9b24e4e313, generation to exclude: 1,
> generation to return: 1
> 2020-07-03 10:48:00,161 [ReplicationMonitor] WARN
> org.apache.hadoop.hdds.scm.net.InnerNodeImpl: Ancestor not found, node:
> /rack1/039cb21e-4e2e-47e2-bf3e-b025319ee856, generation to exclude: 1,
> generation to return: 1
> 2020-07-03 10:48:00,161 [ReplicationMonitor] INFO
> org.apache.hadoop.hdds.scm.container.ReplicationManager: Container: #210419.
> The container is mis-replicated as it is on 1 racks but should be on 2 racks.
> 2020-07-03 10:48:00,161 [ReplicationMonitor] INFO
> org.apache.hadoop.hdds.scm.container.ReplicationManager: Sending replicate
> container command for container #210419 to datanode
> 5cb315e9-7326-4592-8dd6-21f4342b09c1{ip: 9.180.8.85, host: host-9-180-8-85,
> networkLocation: /rack10, certSerialId: null}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]